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requested. 



Attached hereto is an appendix entitled "Version with Markings to Show Changes 
Made " which depicts the changes made to the instant application by the current 
amendment. 



Applicants acknowledge the Examiner's remarks concerning the Declaration. 
Accompanying this Amendment and Response is a new Declaration executed by inventor 
Andrew Chan, which correctly identifies U.S. Patent Application Serial No. 08/819,013 as 
patented and U.S. Patent Application Serial No. 08/788,322 as abandoned. We are 
presently awaiting a newly executed declaration from the second inventor, Chong Fu. 

Applicants acknowledge the Examiner's objection to the incorporation by 
reference of subject matter deemed essential* The subject matter in question concerns the 
nature of high stringency conditions for hybridization, which is deemed essential because 
"high stringency conditions*' is recited in the pending claims. Applicants point out that the 
pending claims have been cancelled, and that the new amended claims do not recite for 
BLNK proteins encoded by nucleic acids that will hybridize under high stringency 
conditions to other nucleic acids. 

Applicants request withdrawal of the objections. 

Rejections Under 35 y.S.C. SI 01 

Claims 23-34 stand rejected under 35 U.S.C, §101 as lacking either a specific and 
substantial asserted utility or a well-established utility. The Examiner expresses that 
BLNK protein activity is not defined by the instant specification and that the utility 
asserted for BLNK protein by the instant specification is a general utility. Applicants 
traverse. 

Applicants draw the Examinees attention to the revised U.S. PTO Utility 
Examination Guidelines published in the Federal Register, vol. <56 ? No. 4. At page 1098 of 
the identified volume cf the Federal Register, at section B, 2, (2), the guidelines state: 



Objections 
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An applicant need only provide one credible assertion of specific and substantial 
utility for each claimed invention to satisfy the utility requirement. 

Further, at page 1096 in re$ponse to comment 19, the Commissioner cites Fujikawa v. 
Wattanmin, 95 F. 3d 1559,1562, 39 USPQ2d 1895, 1900 (Fed. Cir. 1996), which states: 

"[A] 'rigorous correlation' need not be shown in order to establish a practical 
utility; 'reasonable correlation' is sufficient." 



Further, at page 1098, section B, 4, the guidelines state: 

Office personnel are reminded that they must treat as true a statement of fact made 
by an applicant in relation to an asserted utility, unless countervailing evidence can 
be proved that $hows that one of ordinary skill in the art would have a legitimate 
basis to doubt the credibility of such a statement. Similarly* Office personnel 
must accept an opinion from a qualified expert that is based upon relevant facts 
whose accuracy is not being questioned: it is improper to disregard the opinion 
solely because of a disagreement over the significance or meaning of the facts 



The instant specification asserts a number of characteristics and functions for 
BLNK proteins that support that the claimed BLNK protein compositions have specific, 
substantial utility. For example* the instant specification asserts at page 6, lines 1 1-15 that 
BLNK protein is tyrosine phosphorylated by Syk following B cell receptor activation, and 
at page 20, lines 6*8 that BLNK protein binds to Grb2, PLCy, Nek and Vav, and regulates 
calcium levels and modulates cytosketetal organization, and at page 19, lines 28-29 that 
BLNK protein is critical for B cell receptor mediated response and B cell function. 
Applicants submit that these statements are credible and should be accepted. 

The instant application also provides methods for using the claimed BLNK protein 
compositions, for example to screen foT bioactive agents that are capable of modulating 
BLNK protein activity (page 23, lines 4-11). 

Applicants submit that the asserted function of BLNK protein is specific, as the 
asserted binding activities and B cell regulation activities disclosed in the instant 
application are not properties shared by all proteins. 

In addition, Applicants submit that the asserted utility of BLNK protein is 



offered. 
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substantial, as the ability to modulate B cell function and to identify bioactive agents 
therefore is clearly desirable. 

While Applicant submits the assertions of the specification should be accepted 
without any further discussion into their accuracy, Applicant further submits support of the 
accuracy of the assertions in the form of the enclosed Declaration under L132 (the 
declaration). In paragraph 5 of the declaration, the inventor Andrew Chan, Ph.D., 
representing one of ordinary skill in the art, declares that he would expect to be able to use 
the claimed BLNK protein compositions as provided for in the present application. 
Moreover, the declaration discusses data already of record which confirms the accuracy of 
the assertions made in the application. Specifically,, the declaration shows that loss of 
BLNK gene function results in abnormal B cell function, supporting the assertion that 
BLNK protein is a modulator of B cell function and that the loss thereof results in a 
BLNK-mediated disorder, and that the claimed BLNK protein compositions have specific 
and substantial utility. 

Claims 23-34 have been cancelled without prejudice, disclaimer or admission. 
New Claims 35-38 are directed to BLNK proteins comprising an amino acid sequence 
having at least about 95% identity to SEQ ID NO: 1 . Claims 36-38 are further directed to 
BLNK proteins comprising SEQ ID NOtl, BLNK proteins which will bind to specified 
BLNK protein binding partners, and BLNK proteins which lack specific tyrosine 
phosphorylation sites as set forth in SEQ ID NG:1, respectively. 

Claims 39-41 are directed to BLNK proteins comprising an amino acid sequence 
having at least about 95% identity to the amino acid sequence encoded by SfiQ ID NO:2, 
Claims 40 and 41 are further directed to BLNK proteins comprising an amino acid 
sequence encoded by SEQ ID NQ;2« and BLNK proteins which will bind to specified 
BLNK protein binding partners, respectively. 

Claim 42 is directed to a pharmaceutical composition comprising the BLNK 
protein according to any one of Claims 35-41. 

Claim 43 is directed to an antibody that binds to the BLNK protein according to 
any one of Claims 35-41. 

Claims 44 and 45 are directed to methods for using BLNK proteins which BLNK 
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proteins comprise an amino acid sequence having at least about 95% identity to the amino 
acid sequence set forth in SEQ ID NO; 1 and win bind to Grb2, PLCy, Vav, or Nek. 
Claim 44 is directed to methods for screening for a bioactive agent that will bind to a 
BLNK protein, while Claim 45 is directed to methods for screening for a bioactive agent 
capable of modulating BLNK. protein activity. 

Applicants submit that the new claims satisfy the utility requirement of 35 US.C. 
§ 1 0 1 and request withdrawal of the rejection and allowance of the claims. 

Rejections Under 35 U-S.C. SI 12, First Paragraph - How to Use 

Claims 23-34 stand rejected under 35 U.S.C §112, first paragraph as failing to 

teach the reasonably skilled artisan how to use the invention for a credible* specific and 

substantial utility. Applicants traverse. 

As discussed above and supported by the accompanying declaration, Applicants 

submit that new Claims 35-45 satisfy the utility requirement of 35 U.S.C. § 1 01 . 

Accordingly, Applicants submit that a person of reasonable skill in the art would be able 

-tenise the invention in full scope of the claims for a credible, specific and substantial 

utility. 

Applicants request withdrawal of the rejection and allowance of the new claims. 

Rejections Under 35 U.S.C. 61 12, First Paragraph - Written Description 

Claims 23-34 stand rejected under 35 US.C. §112, first paragraph as lacking 
written description support in the specification. Applicants traverse. 

The Office Action expresses that Claims 23-34 are directed to a very large genus of 
recombinant polypeptide species, uses thereof, and antibodies that bind thereto, 

Claims 23-34 have been cancelled without prejudice, disclaimer or admission. 

Applicants have amended the claims to further define the scope of the claimed 
BLNK protein compositions. Claims 35-38 are directed to BLNK proteins comprising an 
amino acid sequence having at Least about 95% identity to SEQ ID NO: 1 . Claims 36-38 
are further directed to BLNK proteins comprising SEQ ID NO:l ? BLNK proteins which 
will bind to specified BLNK protein binding partners, and BLNK proteins which lack 
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specific tyrosine phosphorylation sites as set forth in SEQ ID NO:l ? respectively. 

Claims 39-41 are directed to BLNK proteins comprising an amino acid sequence 
having at least about 95% identity to the amino acid sequence encoded by SEQ ID NO:2, 
Claims 40 and 41 are further directed to BLNK proteins comprising an amino acid 
sequence encoded by SEQ ID NO;2, and BLNK proteins which will bind to specified 
BLNK protein binding partners, respectively. 

Claim 42 is directed to a pharmaceutical composition comprising the BLNK 
protein according to any one of Claims 35-41. 

Claim 43 is directed to an antibody that binds to the BLNK protein according to 
any one of Claims 35-41 . 

Claims 44 and 45 are directed to methods for using BLNK proteins which BLNK 
proteins comprise an amino acid sequence having at least about 95% identity to the amino 
acid sequence set forth in SEQ ID NO:l and will bind to Grb2, PLCy, Vav, or Nek. 
Claim 44 is directed to methods for screening for a bioactive agent that will bind to a 
BLNK protein, while Claim 45 is directed to methods for screening for a bioactive agent 
capable of modulating BLNK protein activity. 

The Office Action expresses at page 5 that the instant specification does not 
provide sufficient teaching or guidance for one of reasonable skill in the art to determine 
sequences that are within the scope of 95% identity to SEQ ID NO:l and SEQ ID NO:2. 
The Examiner as$erts that because no specific algorithm is disclosed, the claims do not 
find written description support in the specification. Applicants disagree. 

Applicants point out that specific algorithms are disclosed in the instant 
specification. At page 24, lines 6-7, the instant application states: "All references cited 
herein are incorporated by reference/* 

Further, at page 5 3 line 12, the specification cites the prior art of Altschul et. al. ? J. 
MoL Biol. 215:403-410, 1 990 (Alt$chul-A) (a copy of which is attached as Exhibit A). 
Altschul-A describes the basic local alignment search tool (BLAST) and at page 404, left 
column, paragraph 3 discloses parameters for measuring nucleic acid similarity. 

Altschul-A also discloses the prior art of Altschul et. aL» J. Mol. Biol 219:555- 
565, 1991 (Altschul-B) (a copy of which is attached as Exhibit B). Altschul-B discloses 
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the PAM-120 amino add substitution matrix and scoring parameters therefore. The PAM- 
120 matrix and parameters are found at page 560, Table 4. 

Accordingly, Applicants submit that the present application does properly disclose 
sequence comparison algorithms. 

Applicants submit that one of ordinary skill in the art would clearly construe the 
meaning of 95% sequence identity based on the teaching of the specification and the 
knowledge held in the art, and would conclude that Applicants were in possession of the 
claimed BLNK protein compositions at the time of filing of the priority application. 

Applicants submit that Claims 35-45 satisfy the written description requirement of 
35 U.S.C- § 1 12 3 first paragraph and request withdrawal of the rejection and allowance of 
the claims. 

Rejections Under 35 U.S.C, SI 12, Second Paragraph - Indefmiteness 

Claims 1-22, 23, 25, 27-28, and 31-34 stand rejected under 35 U.S.C, §112, second 
paragraph as being indefinite, In particular, Claims 23, 25, 27-28 and 3 1-343 under 
consideration in the case, are found indefinite for use of the following phrases; 

i) "polypeptide comprising the protein" (Claim 23); 

ii) "high stringency conditions" (Claim 25 and 28); and 
in) "polypeptide" (Claims 27, 31-33), 

As a preliminary matter, Applicants point out that Claims 23-34 have been 
cancelled without prejudice, disclaimer or admission. 

New Claims 35-44 do not recite for proteins encoded by nucleic acids that will 
hybridize under high stringency conditions. 

Applicants request withdrawal of the rejection and allowance of the new claims. 



Applicants submit that the application is now in form for allowance and early 
notification of such is requested. If there remain issues that the Examiner believes may be 
resolved by telephone, he/she is respectfully requested to contact the undersigned at (415) 
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Basic Local Alignment Search Tool 



APPLICANT'S 
EXHIBIT 



Stephen F- Altschul 1 , Warren Gish 1 , Webb Miller 2 
Eugene W. Myers 9 and David J. Lipman 1 

4 National Cpnterfor Biotechnology Information 
NutiwttfU Library of Mndicinfl, Nutuntil Institutes of timUh 

Bfithceda, AfD 20X94. U.S.A. 

'tfiaparlmmt of CampvUr AfcfViuH 
Pennsylvania Stall University, University Park, PA lB802 t U.8,A* 

' *Depatt77i#nl of ComptlUr Sr,ie.nc# 
University of Arizona, Tuwon. AZ 85721 M U.8 A. 

(RfiMived 20 February IBM; accrpCed 1$ May 1990) 

A now approach to 1+141 id oequenuf comparison, haale inc-al alignment aeexch tool {.BLAST), 
directly approximates alignments thtwt nptijntar a measure of local .similarity, the maxima) 
ue^mant pair (MMP) sficirt*. Recent mathematical results on the stocjha&tio properties of MS* 3 
scores allow an analysis of tho performance of this method as well ajs the xfcatfetieoJ 
significance of Af|>*nmcniH it generates, Th» basic algorithm fa simple and robust', it can be 
implemented in a number of ways and applied in a, variety of contexts lnduding**traight- 
forward l»«A and protein sequence datable tfearehegi motif ^earohes, gen* identification 
^eart heft, and in th« analysis of multiple region* of similarity in long DNA .Hefeuences, In 
addition to ita flexibility ami trartnbility U\ mathematical ahxlysis, BLAST is an order of 
magnitude fftstttr than existing sequence rum pari ut>n tOulfe of Compare ble sensitivity. 
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1, Introduction 

The discovery of .sequence homology to a known 
protein or family of protein* often provide* the fir^t 
clufix about- the function of a newly sequenced gene. 
As the DNA and amino anid aequftm-^ da tallages 
continue to grow in sisse they beuome increasingly 
useful in the analyfth of newly raquenro-d genes wtd 
proteins because of the greater chanm of finding 
Huth homologies. There am a number of aoftwun? 
toote for arching soquunuu databases but all use 
dome mea.au ro of similarity between rii-quenncs to 
dutmgmsh biologically trignifiuant n-.Iationfthipd 
from chantu flirnitariUflfl. Perhaps the bent yturiitd 
mfttuturc^ jixe tho^'iwed in conjunction with vuria- 
tionii r»f the dynAmSo prograMming ^I^oHthin 
(Needl«rnen & Wungnh, 1970: Sellers, 1#74; SAiikoft" 
&. Knask^l, 19R3; Wnt^rman, 1984). Th^e methods 
aaal^n scores to insertions, dotations and I'epl^- 
Tnents. and compute an alignment of two sequences 
that eorreaponds to th« leasi, mostly t*et of such 
mutations. Swoh an alignment may b« thought of as 
minimi^in^ Lhc evolutionary dixt^nco or mttximiKing 
th« similarity between the two sequcnoes compared- 
ln either c*^e b tlw cost of this Alignment \% a 
me^ire of eimih rity; th* ftlgorithim ^uwrtntwa it i« 



*03 



optimal, bjusnd on the ^ivcin .sourer. Uecaufle of their 
computational requirements, dynamic $ro%r&m- 
ming ntgorithftlft uro itnprantica] for Reat'ching iarge 
d^taha^etf without th« uee uf » KuperCOrnptiter 
(Ootoh & Tago-shira,, or other speuia) purpose 

ha-rdware (CJoulson 6t frf. r 1987). 

Rapid l«ioHwtic algorithms that attempt to 
approximate th« above method* have been dove- 
1npt>d (Waterman, 1984), jtllowiog large databases 
tn b« ¥oanjhed on commonly available «omputors* 
In many heuristic methods the measure of simi- 
larity if) not e.\plicitly defined as » minimftt coat ect 
of mutntiona. but instead \h implicit in the algo- 
rithm itwdf. Kor example, the FASTH program 
(hipman <t Pearson, 1 935; Pearson & I*ipntau 9 1088) 
hi-st rincifi locally Hiinil&r regions between twc> 
aequenue^ based 00 idftntiticf but not gapu, and then 
reacorRH thvse rt*gions uwiig a mea^wro of similarity 
between residuea, a\toh -ms a PAM rnatrix (DftyhofTef 
til,* 197H) wbi«h allows conservative ^eplat'emcntH ad 
well hk identitie« w increment the similarity 
Despite thefr rather indireet approximation of 
minimal evoJution mea.Nures l heurjatie tools awch a-s 
FASTI* have beam quite popular and have identified 
many distant but bioIcigl&aUy significant 
relatioimhipf!. 
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Jn this pAfW w« <Uwrit>tf a i»v method, HLAKTt 
(Hasir I^ncivl Alignment Starch Tool), which 
employs a measure based on well-dft fined mutation 
scores, It dirpntly &pproxiifititaH the fC&Mtl* that, 
would he obtftinotl by ti dynwrtH? programming ftlpo- 
rilhm for optimising thi* ineiitfiir*. The method wiH 
detect we&h but biologically significant dequejK-e 
FJim if unties, und iw mor* Chan *n order of magnitude 
FiiA&cr than existing heuriilic Algorithms, 



2. Methods 

(a) Wtf mttxtmal Jff(tfM7i£ pair mraxurt 

fitajurnw airnil»rjty ftiH»wijNw ft rally cut 1>C classified 
a* either j>]ob*J or IopoI. ftlobal similarity algorithm* 
iijaimisic the overall alignment «f t**o »equem*i<. whiuh 
may Include lo-fge otrt'K h*w of low rtimilauty {Noodle ma** 
& Wunnch. 1075), Loc-id nimrliLrity rtlftnrfthmh scok only 
rtflMivffly cH»n;*i*nu*3 sudtacqiifNicoa, and a riin^le compari- 
son m*i.v yitttd srvera) distinct subKequenre aligkimi'Mbi; 
ufirnibiftrvcKl region? do mn contribute to the mmiHiiiv of 
similarity (Smith * W»rprni4fi, lrtHT: (iimd & Kiuiehlx^ 
1932; Heller*, WIH). Loe-*lI similarity meiiKures are 
^HnpraHy preferred for chULibn.^ Kv&reWn, wlicm (*I»JAm 
m&y be compared with partially rtct|urMtv<] firm** And 
where distantly rhln-tnli pmtcuu may rchow Only isolated 
ra^ivnn i>C similarity, eg. in the vicinity of *w *rt>*"o aite. 

Mt\ny similarity rnt»RmJK»H, intituiling the oiki* wi k 
employ, tappn with a matrix of similarity scorer* for all 
jmis^bto p&irs of ^jdw«i TrfentitieN und niiiNcrviifciw* 
r<tplaceme»>tfl havt* piwiiive. Htajre**, white unlikely reptaee- 
ment* htwe n^^.ivu wiims, for amino add sequence 
tOCn^ftrii«ttiH ww gitrorally ust* the PAM-120 in*Ltri* (* 
variation «f that of Dayhoff ff nl., IIHS). while for USA 
?i<tqiir*ncc eompftrUoiw we nnir* 1 nlftntiti^ +r». and 
mi«Tna.rche« —4; other yvtirv* £i'e of rour.sr jtdHsible. A 
sequence seKim*nl Ik ti ffintiguou* rttretch of re^idue^ *»f" 
»ny length. Ami thtt ftlmil&rity Ht-i>^ (or lwh ivli^ncd 
^'gijfif-iiL< of the 4Aine J4*ii|rth in tint «mrn uf thv. nimilfrrity 
va-lufM fur fcat h |»hii- iifiiligniirt mklueri. 

VtivKii thftH« ruloN. we define & m^xim*! ^r^^ni. jiftii- 
(MflD to he thr hrjt^hwt, (jK'ofinf5 pair of identical length 
segmwnta rh<w*n fr^n\ 2 noquencett. The botuuUrir* of an 
Misf Art' i",hu«(in mi mttrximiw its f*cor«», *m MHI 1 msiy he 
«rtny length. The MSP wh» h BLAST heurlscically 
attempts t<» cft'mjl».t4\ jtmviiW n inc^urc- of local qimi- 
iftrity rbr Any |iair uf KfcijUftncfri. A molnulfti' IjiulnK'^t* 
hipwrvrr. may be interested in ail rnriHiTvrd n*ftirt(^ 
Hhaw^d by 2 proteiiiH, not «»«l,v in fch<:ir hi^hrst scMin'ji^ 
pair. Wf tlier+*ri>n: define IX M*pntrnc pair to br loc«Jly 
mAXim^l if it* W!uro cAiuint be jinj»rove«J eitlipr by 
«sV«ndinR or by shnrtenin^ both w^nnnUt (SVIIum. HiW4) T 
BLAST can seek all locally inaxlMA.1 «^moi»t poJrw u^'tli 
srorea i\bovt Hfimv tiuLnff. 

hik^ miiiiy itthttf rtimilarlty ^f^ui^n, rhf. MHI B hihih- for 
% 4^qu«rcJw mu.y be «<>mput*d in timH |inip»rtiiin»l t« t\\o. 
prftd«p£ of thetr length* v*h\^ ji simph? dynamic program- 
ming tvl^urithtn, An important adviintstfe of tbr MSI* 
mifAfiiuo is th*t recent mAth^nuvtii'al r*?«ulf« allow the 
*tatfsriCAl Miwriificttn^H fif M&P scores to bt ^tiin^teil 
nutter a\\ *ppriipriate random bc^ucikv mi^tal (KnHin L 
Altsrhvd. liM*U; Karlin f( fr/., I i'*ijrthftrmnr<?. for any 



t Abbreviaticm^i HLaST. bla*t loc*l ulisnrnwu. 
h»MC-pAir(sf; 



pjirtii-ulftr «rtJriiip matrix PAM-120) on* c»n e$tim»tft 
the. frcquonciajj of pulRd nafdues in injtxin?*4 /tegmentd. 
This trMt-Evbilio' to m*thtfrt:*tic*l Aiialya^ it a cruoi&l 
i'l^Mir*- of the 1JLAST Algorithfti. 

{bi Ra-pid (fpprgfinuUUv. t»/ MRP AfiiWM 

in KHin;hin^ jil d<Lub«ji« of thousands of sequences, 
^h^rAlty oniy a handful » if any, will be homologous tp the 
ijiiery wquenc*. The dcientitft U thereforti inUtnMt^d id 
[denttfying only thow± wqvurioc «ntricfi vith M^P scoren 
ov^r ovtoff &ft(*rii ^. These se^uencsa. lncJud* thoae 
rtUtirinR highly signlficunt similarity with the query an 
oa some sequenwB with bordferiinn Bftor^i. Thin kfcter net 
of tftiquunffeft may Mu-lude high scoring random mat<fhpy W3 
wftll Aftqticnccft distantly related to the qvnry. The 
biological dlgniticnn^ pf the high rearing s«|tieTi«s m«y 
U* irtf«rrt?d fcltrtOi't Holely on the b&a|s> of the «im«lB-rity 
xvttrti, while the bio(ogi«bl context of the bordtrlinft 
wquenc^!! may he helpful in UibtirtftuiNhifiFf biologicftlly 

)Wt-vT\i rt^ultj* (Karl in & AltschuL 1^90; Karlln pf of., 
1!>!W>) dllow uft to c^timit^ thti highest MS^ ucarfr £ At 
whifh ^hanc* *imlkritie» likely t<» *j»|>«&r. To 
emt+ tUubiiw tfr«rvlwci. HLA^T fnimmiKo* chG time dp«nt 
cm uvquftrH* rcgionfl whosfi similarity with the query h« 
liulcr charifii^ <»f cxctjcding this ecore, Let a word pair m 
attgmmc p*ir of fixed length vk The ffiftin strategy of 
BLAST is i** niHik only fiegnimt pAirfl th&t contain a word 
pair with a wott of at least ?\ SranninK through i 
y^qutru^. r>nt* rttn ifeterrtunfs qiii^kly whether ir contains a 
wiirrf of l^itRth th&t can ptir with the query sequence to 
prndueft a w»rd pair with (tbscort 1 greater thcin or fiqual to 
thr tiirfithold T. Any such hit ie extended to det<*rmintj if 
it- in contained within a segment pair whod* scorft 
pTeatf r than or f cpi»I tti i 1 ?. The l(jw«r thic thn^hftld 2*, the 
fjmitrT thi- rluirior thar- (l ^c^msnt pcir with ft score of At 
Irtl^t jS" wilt ccmtafji r. word p#ir with *. boohj of *t loflJt T. 
A Mnitll value for T T how«v^ intTe^ietf thtf number of hit* 
Am! thrrrftw thw ^Wfutiftu Llnift of Lhe ikJgorithm, 
ICaiwliin\ ^mutation jKriniU- us to select a threshold T 
th;vt hAbLnws the.-*e rongideratiorid. 

Tn our lmplem*ntatientf of tbia AjiprcuMih, dftiu-ils of thib 
^ algorithmic t*t^j^ (iuwkiIv viftmpilirig b list of high- 
^I'on'rt^ wikrris^ stuinninj; the dfi.tft.base for httd, *uitl 
4-x ton ding hitel vary somewhat UeptsridiliR w« whfcthisr Wio 
(lo-tflbtLHf containA prot*inn ar DNA nfl<(UcrvtnsH. For pro* 
tri)\H. the lift 4^in^tNt>i »r aII wor<U (w-mm) th&t Kcorct it 
iparit ^ wh^n f-ompttrcd to same Word in th« query 
^•qutiiitu:, Thu>j, a qu«ry word m*y be tvprw«nt«d by *o 
word.i in the Hat (a.g. for cummon <tf-Frt«rv H^irtj^ P AM* 120 
k(vitw) or hv rnnriy. {One nifty, of tiour&o, iimUt that every 
^rnftr in th<* query pequtnee be included in tht w W rd UhI, 
irrespective of whether p»iHni5 the word with Ifcuclf yielda 
* H<wna of »tt ltu«t 7 T .) I?0r valuw of W wad 7* th*t wh h*v« 
foumt motft vwful (««: Iwtkiw), thfcr* ore typically of the. 
nrflrr of fH) WQrdu in the list for every rwidy« m th M qu«ry 
sequent?* e.^. 12,500 wunfy fur a hu^uoilao of length ^^O. 
If a ltLtl« «'nr« in Uikan in programming, the IUt of wurda 
<!*i!i bo generated in time essentially propvrtr'irtuJ tfi t)ic 
length of the lint. 

Tim rtiiuuming phftw raised a* pIumIo al5or^thmH , pmb- 
Icrti. 1.4',. wiar^h n long sequence- for all occurrence* nf 
«rtili\ tthort J&quenutt?. We inv^ti^tj^I 2 upproacho^ 
^tmplifTed. tlm <irwt workn follows. Suppose that w*=4 
and mtxp ea^h word to an vntt^w b^twH«n ] urtd 30*, so 
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u'ord i'Aii *>* an index hi to tin array of nlv,i* 

idO t Wrf*. I*-t the il-ti «mry oI'hucIi an *rr»,y point fcr> 

[h** <>f fr" »M"Uirt*»iii , *i ill the tiurry tfcrqufriLii of the itli 
n-ord. ThuK. *h wi» st'iUl th« tfHiahttNp. ^m-h datiLbtxi' wwl 
l«flude uh immediately to the tuireHiHmdmg hit*. TyptaoJfy , 
tody ft few choujwncl of the SO 4 piftdbfr* wr*rdh wiJI tier in 
thfe tabl*', nnd it m eo*y to modify the frpprowh tn use Tar 
fewer than 20* pointer*. 

The ieeimd (ippruach vxplared f«T the *»-mtning 
phm «w »f * dr*i*rFrtiiifatfc tinim Automaton or 

finite st*t* machine (Mtt»ly. JWft©; HojmtdO A UNtitaii, 
1979). Ah inijuii'C^nt fentun* of Our i-i)lmtnict|(wi wan Lu 
^igflftl avre-puwiee- on transition* (Mealy pdVAdi^rn) nn 
opposed tn on Ht»Uw (Moore paradigm). In the &uu>ma- 
hhi'b etmntriietiori. til is tuivrd n. ftu-U'r In .*|KUH' ami (imp 
rt>li£hJy proportional to the Hi*r of the uiiU^Hyiii^ 
alphAbfct- This method yielded a ju'oprom LhaL run filacer 
w itf *ve prefer thih= anprwh for £er*«ml uwe. With typieaJ 
quwry lf*ii£th*» wid jJAf^mowp maiin^s, thin vcvhImx 
BjT-AfcT w*iiiK a jiniu^ln U(*tfUifl5P *t A}ipro ^> rriu.ee |y 

S(K).tH)0 rehirtut-Vn- 

K^tt^dlnp u hit to ^fni] u loctvlty maximal HKgriiont [wr 
conttafli^ thut, hit Is ^tm^htronvHrd. To rvanumlfcc timi< F 
we t^^min+tt*, the jjrort^S ot ex^;julin(X in uilf dinc.ctiuiii 
vtrh^rt we niAch a twjLAiftnt pair whose sen re falJ/i tt t:rrtikin 
dwldftve holow tin* beat won* ruunci feir shorter rxu k i\»ion», 
This wtroduww it Fur titer clpparturf from r.hi* u\e t *i uf 
fiivditifi gudrnntctd MS1V» but the &dd?tl inocturfti-y 
negligible, tiMf ran he dcmarjHtrdted by both exfM't-imont 
^nd An*ly^i* pr" win corner thr default 

disttfinisfr U ^0. frnd th<^ probability of miming a hijtlusir 
scoring eKl4m*ion ii* ^hout 0-001). 

For 1>NA, umi a siinpli-i word list, i.e. the |j*t of nil 
(^ntitiguOUA w^mnra in tjuery nequeiu-c, often with 
»u m ii. ThUHi a <)U«ry -pequviuMr of leu^tli k yields h. Ifst of 
^—i*+t words. >tnd iig^i" there mis nommtmly a few 
thftuwxil wordrt in the lint. Xt ir! uiJvAhtagenux Uj cornpnwi 
the dfttcvbn^' by p&i-kiiig 4 nuf Ion tide** intn a tfin^lti hyt«, 
m*)'ri*^ auxiliary t^tble to delim»t the hiiund^rivK hoiw<»t*ii 
^ljureiit wvqutiiiirK. Attjuntiri^ <«2»II, hit mtiHl 

vi:Jitnin n\\ ^-rnur hit th*t hes on m. byte bouiidkiiy. Thin 
obaervfttion H-llciw^ tti wean tbu dAHib^t; hy te-wine and 
thereby inrrcdae ttjJi^d 4-fold. For e.u*'h H-mer hit. w»» 
(ihtck for en eju'Iunin^ w-mv.r hit: if fcumd. wk i^t^ud 
befonr. Running on it t^UN4, with a. qu^ry of typient 
JwiKth hfivertLl thouHMid boaea), BLAST at 

npi*rax"rt4tely 5 x 10* b«uK-Vif< At faoiiitW ^itU-h run 
ms-ny nurh rie*rrshea a dny. loiidfn^ the Mirn^ivawHi duta* 
htine i»To memory uiicse itt ii shared memory veheme 
Jiflord* a Hubattthlial j^n-vm^ hi Mubm^u^nt Ne4rch tiin^. 

It rrliould hi; noted that DNA tMic^ue.iu^it u-rc highly non- 
F&rtdfim. with lueuNy biawi-cJ btiw?*- crtmpuikioii (e.g. 
A + T-Hi-h i^igiin). antl tviH-u.&ed Kequtiw elements 
Alv rtpij«encen> and tbif hai? irnpurtaut i*oiiH4Wjqen<fs for 
the de.-dgn of a DMA dii-t^bn*^ Heareh um>1. If *i given 
qii«iy «equei»cc hrt#. fol* example, uri A + T-n'^h sub^ 
«^<|iM«iitju, or * vuMRinttly ovcurrii^g «*p(*titivi> vlemnnt. 
then & dtLU-bftiw HfArfih will produce * copiotm output tif 
rrntttifn;^ with little, int^j'eat. W» hivp il^igned «me« 
wbiHr arf K bur tsfl'wtive it^*rt* of (t«alihe with t-h#vse m 2 
problems. The progriwn Vhut product* th^ oompreK««^d 
v^r«ii^n f?f t!tH |»TA H«.iabw tftUulatetf the frequencies of 
all a-tupUw. THomv. oticurrii^ much niore fi^tfuantly thtii 
HKptiutcd by thane* (uomwOlttbJt: by p*r»rt)cter) arts ntrtiyd 
itiid \xmi filter ^uiiinfurmativs" words fnim the query 
word l?st. Al^o, preee-ding full rJAutba^ wJiVrchen, it fctsMTh 
uf * jftublibrjtry nf repnitiv*. piemen tn is perfoj /iied, And 
tlw Mi:»tioiix in thw (jtwry of fli^nificwtt r^itchca ttn; 
tiMJT&d. Words ^ *iiernted by the<w nsp;tor\H 41^ rem<ivftd 



fmm the query word li#i. for the fuM (Wtarch, M^tchtw t& 
Lhe sublibrAry, however, mre reported In thtt find! output, 
The*e 2 filter* a.lltiw ftJignnien^i tu region* with bi«ied 
composition, or to region <ontainin^ repetitive- elements 
U) be repurtt^d, ^ Jung iu adjM^nt region? not oont*jnikig 
jiaeh fi;aturv« shire aignifi^noL Kimil*rity to the query 
HBi|uonre. 

implemented « v^etion of BLAST th4t uses dyrt^mir- 
prd^nlrjimi'ng t» extend hits so tn eljgw in the 
n'MLiltiil^ JtdifC'kineiitK. bToodlertfj to Any, thlH Rrefttly bIowb 
th*> exfc4?fiMftn jtrnewt. While thr. neiwitivjcy of amino &c|d 
iP:tr«-h< l s *ftji rrnproved in M>me i;^, the eeleotivity wtm 
milled as well Ofv«;i tb? tr»dft-off of Hp*cd ind «Klea* 
tiviry for «enfitivity, it is quttRvionab]^ whether the gap 
^or.Nictn uf 11LAHT <n)ftattitutifH a\\ iinprovem4i\t, We, ftlao 
im|b|i-mrnUH{ th^ alternative of making a ubta of &4) 
in-wurreni'Htf <if the. vi-ihars in the d&tab^w, then saiEmUvg 
the (fticry s^tjuenci: and prncebiing hit*. The di«k apace 
requin;m?llt^ oon»id*«rible, fl.pproxima.t#Jy 2 computer 
wiH'dn for e^ery ren^luc in the datg.b*ae- More d^i^ing 
vim thAt for query ftquenw^ of typical length , th<* need 
for riuidorn, hcmaa into the df-tubase oppostni to 
M«(|Ueiui^i *v.eett&) m»dfi the approach [ilowerj on the 
t'ompvtrt aysteiDK we uittfd, than Banning tl>H entire 
databfutr. 



3. Results 

To evaluaU the utility of our method, we describe 
theoretic*) rc-^ulti about the statistical siftrufiqance 
of MKP scores, ntudy the Accuracy of the atgoHthm 
for random frequences ftt (tpproxi'maling MSP scored, 
compare the j)«rfortno."(^ of the a.ppmximatiori to 
the full ^aJo ulation on tt ict of related protein 
sequfthtea &hd, finally, demonstrate its performance 
comparing Inng DNA sftquenccfl. 

(a) Parf&rrruinte of BLAti'l 1 wlh random sequences 

Theory u r^l resxdtK on the d lb Jibuti on of MSF 
scores from the comparison of r^ndnm ejequrtnee^ 
Un-v^ reMtitly bfeooncie ftvajUblt; (Kiwlin ^ Altachul. 
1090; Ivarlin tt at,, 1^90). In brief. ;»iven a set of 
prohabilitiG-s for the occurrence of indiv-idutd 
re$idiie±i, and a »el of woor^ fur aligning pair* of 
residues, (he fcheury provide^ two po-rarnftter? Jl and 
A" for evaluating thn st*tif tioul aigrtifiofwicc of MSP 
sc-ftr^H. Wbnn two random sequences of lengths m 
&nd 71 4i-c compared, the probability of finding 4 
.segmput pwr with a ftoore groatKr than or oqtml to 



1 — e"J 



whrtr* y=s A"wt6 e _Jff , More genfcraUy, the prob- 
ability of finding C Or more distinct segment pairi, 
R-ll with & fcaore; of at J^st Is giv^rt by the- formula; 



(2) 



Uaitifj l-hifi formula-, two sequenub* thataJiarc ^Bv^ral 
distinct regions of aim i ferity can some times be 
cfatevted significantly r^attd, ftven whe,n no 
segment pair is artistically significant in isolatifin. 
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Figure 1. TV probability 4 of II L. AWT mifHiujc A 
random maximal segment pair tu* a * um-Iion nfit* more tf. 



Whilo rinding tin MSl' with a /#-v,due uf <H)(H ">a\" 
b<> ?5urpvifl.mg when two speriHe Nt?qop)H'<r*r arc 
tromparcd, fit-arching a. daraUi,^ of HMKW Heipx-hee* 
for idmnarity (tj ;i <iuery iieqafiu'e is likely U» I urn 
up ten .well wgment pairs dimply by ehAiwc. 
Sttgmerti pair ;>v allies mu*t be dhjeuiuued aeiord- 
iaitfly wh<:n llK' sshnilrti' seamen Ln am dbu-uvcred 
Though blind database searches, I'sin^ formula (I). 

can eaU.ulate the approximate aeon* an M.ST J 
must have to be dial iiigiiixhiliJf '"rum c-h'ance 
#imf)arit.ie£ found in a dal&ba«e. 

Wc are intend «d in finding only .se#»H-nl paira 
with ftsenrp. Above Mome cutoff X. The renirnl idea nf 
ihe BLAST »d^nri(hm i.* Id eimfine NUrntt«u U\ 
se^rnent pair* f hiil contain u word pwir of teitjsfh ^ 
fe-ilh A who uf at l^nsl f P. h i* therefore nf inn-rest 
U) know what propHiPViiJii of m'^fiU'iH pulr* vhU n 
^iven si-nj-e <-nntain aneh a word pair. Tlii* qiuMion 
iiiakcti rtvt\KC only in the fiout^st of wnnc dMribuiiun 
of high-aeorinj; .segment pair*. For MKPs arising; 
from l be tumpuritfin of random xcipieiWK. Dertibo 
& KarJin provide «uvU limiting distribution* 

Theory does m>i vol vxim tu ruh'iilAlc* <be prob- 
ability 7 that smch » wgjnenl pair iWIl tail to ennUun 
A word pair \viJ,h a scon* of ui U*u*t T, However, tine 
argument yupge^ts 1 Kilt, q should depend e£|i(ini»n- 
tuvlty the* -v-ore of Jhr MSI*. TSeeiuise lhe 

frefiueiu-iw nf paired Irtlw in MSI'k Approiu-b^H a 

fxjjei'tt-fl length <vf aj> MK|* ^r«WM fni4*m'Ey w ith it* 
nanv. Th^ri-f^ri 1 . *hc lon^jrr nil MSI*, (he rncir** Iude- 
pimrlC'il *-hun<nK il. rff"ri-t|\-fly (uus for niuUMiiirtfi a 
word with Mcoir- nf a.i 1**am1 T, implying that 7 
rfhoukl «(o<-i'efiiHO rxpotirnliiUly with iru reading MSI 1 

To leal this iVJfm, wo j^euMut-ftl tmt* rnilliun patri 
of ' Vftnrlrtm pTatinri Keiiuenevs"' (uning lypiral iMninn 
it id frt-qu^ncifw) of length 250. *wid founci (h^ MSI* 
for ea<-h u^inj? PA_MO^O scrims Ti* Figure 1. ^vr p(ot 
(.he In^AHLhm of the fmtUofj y oCM^P^ with snoru *V 
ihnf. do i\nt ccinUiU* a wurd pair «f Irn^th Four vith 
sonrt* »Lt I** vat lli. Hiiuti the ^'alucw shown are Mubjttct 
to jstittiriU al vv.riAtMiO. firrnr bars mprP-wut. out* 



standard deviation. A regrcwlon line is plotted. 
Allowing for hftU*ri>suwkstit.Hy (differing de^r^eb of 
itt rurAoy of the y-vnlucy;], The cortelalfon coctftcienfc 
fur — In fr) and *S ¥ i« 0 k 90d, SUgfjesitiriK that for prac- 
tical purpowsH tiur model of thp cxponitntial depen- 
dtwtte of v upun A* valid. 

W* rcpeafci'd tlii^ wrtAlydU for a variety of word 
lengths itnd ^w«--iALcd valune of T. Tdbl« t show* 
ihc regression parumpter« a and b found for oaeh 
"mNUtn<tc: th<* Rorrriittion vtwifivietitr wae A-lwuys 
greater (ban *>09S- Tfthle I alao bhgv** the implied 
pvrueuU^ ? = c ta5+b) «f MHPh wiLh vpirjows eoores 
thAt would be miswd by the BLAST algorithm. 
Thftse mini be-in are or course projwrly applicable 
only to <:hnnt^ MSr&. However, using & lo^-odcte 
fit-ore matrix suc:b tiH th« PAM-120 that m based 
upon ompirb^l studies of homologoue pi^otcinaj 
hi(;h-8^*rin{r (ihanc-e Mill's »hi>Mld renumblo MSPa 
thAt reflect true homoloj^* (KarUn A* Allachul, 
Thfri-cforo, Tublr I should provide a roufih 
£iiid£ to <he pcrtbrmanw. of KLA8T on homologo\(« 
well Aa chanty MSl*s, 

buiwd on thf rfKUltfi or Kariin ^ ai. (Ifli»0) t Table 
1 ulrto shown Ihe esfpe^ited number of found 
>vhe» iwiitrt;hii\g a random dat(ihA»e of Ift 7 000 length 
■250 prulvbn ^qucnow with a length query. 
(Thiw*- numbers were cho^n t<i approximate Che 
cniTftrtt sixe, of Uw PIR dfttfttowsp and the ItfORth of 
an avonvKP prou-iji.) A( fic-c»i from Table V, unly 
MM Pa wilh a M ore OV«ci 53 am HIcbIv b« 
distin^umhiiblf 1 from Wiince ^imitAritioR- "With w -4 
ami T t= 17. BLAST nhould m\m only about a fifth 
t>rthc MNI*h wlib thip .ttwrt* » And only ab"«t a tx*nth 

of MSPh wilh iv sc«rf k near 70, Wc will nunPidfr 
Muw the (i^onthm's |M*rf»rmaneO w)^ i^ai datii. 



(b) Th* vhoir.f of ivvrd length W\d 
thrp/chtittl parameters 

On whax ba^is dn we obooKtt the particular setting 
nftbc pAramrtcrp w ftnd'7 f for executing HLA^T on 
n'iil duta; We In-^in by MiruddftHng- th# word 
length u\ 

The limt 1 rt»i|ujivd 1w tXWliT.e BI*AKT \a the *\utt 
of ihc i.ime.s i^quiitfd (I) U> compile a Hftt of words 
tluu. Mui scoiv ;il l*fiv< T wh^n eomjjamd with words 
from th*» tfuery: (2) to dt'Ati lhe datAbawe for hits (i.e. 
matches to word^ on ti\\n \ini)\ and {3} to extend all 
bits to w*ifk srgment pails with floors f»xi»s«din^ Lhe 
cutoff. The time for the last of these tasks is propor- 
tional to the number ofhit^ winch clearly d^pende 
on Lhe parameters vj and V\ Given a random protoin 
model acid a &el of substitution Reoros, it m dimple to 
oah'ulat^ the probability thnt two random words of 
length w will have a Acorc of At least T, the 
probability of A hit arising from a-n arbitrary pair of 
words irv the Cjuery and th<i database, Ueing thft 
random model and seoro^ of prcvioiiM deotion, we 
have calculated those probabilities for a variety of 
parameter choices and recorded them in Tablo 1. 
For a given level nf sensitivity (chance or*msKin£ an 
MNP), one 'y;ri aslt what choiw of m miwmtow th« 
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Table 1 

TVjf prafmbitiip t>f n hit rd mriau* *^tti»4)t lite 'paraiW-UsTP w and T, and (ha 
prtfpvriirm of ratudttm Mtif**} vni*vtiti by HLA&T 
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fthftticft of a hit- KxixmiiiLp^; TaMo L 11 is appyt-wnl 
(ha-i th** p&mTnt»U- r I^I'jh (« « 3, 7 1 )+)- («* +, 
=* Ui) <M)d {u- — 7 T 18) all h&vc artproximatcly 
e-quivalci^t. wenwtivity ov^r th** relevant riiu^ fif 
iiitoff nfiores. Tht* prolmtiilily uf a hit yfcldtu! by 
lh$*r paiir.s i.s tc«n to <feure*w<» |V>r 

infirifttnifi^ i^; Uk* ^4*ni.* i*!nc> hotclM for dift(*rmi lovpl* 
Of wtiiioici viLy . Thia mftkes intuitive ptenst*. for Vho 
longer Ihn word ywiv t-xamirtpd Uiv wore infarftifL" 
tinn ^Aincid about pwl^ntiai MWPa, M&int&tnirif> a 
givtbn |«v<il of $cn*ititfity t wo n&ri Lhoi^forv fU.H.rv*«r 
the tim*» spent ui\ vtep (*.J> t nl>ovi», by iu(;rt*B^ln^[ Uk* 
paraitt(itt*r W- llowflver, Lhtir^ arc rumplemfnbary 
problen)N *;re»ifc«d by la-r^y w?. For proteins th^r^ are 
ptmibl^ words uf lon^th hj, and f(»r a given level 
nf HBiisitivivy th<i number uV wordfi tfenemted by a 
qmiry fjtnwa e^pnneiiCiiiJIy with {For ^jkampl« T 
Using th© 3 pArume(er futirs atnm,i, a 30 WftidOf 1 
(ttquem* wan found to genenwWs word JiaU ci* fiisw 
2^6, 3501 an <i 40.93U mpufttively.) Tiiia fDdivAMw 
th« Um« apsnt <>n st«p (1). and the amount of 
memory reqviml. Tn prfrcticje. w« havo ft>und that 
fur pj'dLetn seardicx the beat vomprAimius ljctwt«sfl 
th^e tMHiaidoratiom ifl with a, v^Orcl Aissc of foiir; ihie 
\$ Lhc fiiir&mcter setting wf^ uxC \n all A.nuly$f>s thAt 
fidlnw, 

Alr.ht?«gh reducing thK threshold T unproved th<> 
appro)crill^tion of MSP scores by BI-AST. it 
incrcasrss execution timei OetiAasc tlwits will be moro 
wordjt ^(fnerAtet k by the query sequence an^i tlieic- 
for^ mt>itj biw. Vhitt VitHie «f 7 1 pri»vid<ie a fe*son- 



tiblv raimpromine between the otvi^d^rnLtions of 
KC.TisitK'ity &nri time? To provide numaricftl dat*^ we 
c-ntnpAKui a random 350 rc^iciue s^qwencfe against 
\hc TIK d*U>ist«! (Release 23-0, 14^72 

cntrirx and B.97*7,JH># midues) with T ranging from 
20 to 13. hi Figure 3 we p|gt the execution time 
(usit Umo on *v .^1^4-280) varJsu* tht* imiY]b&r of 




£»5 $'0 



7-3 



Figure 2. Tht 1 *ityfrtr*l proawMiug wait lim« squired 
HxrH'Ut* liT-A^T on thif Pitt protein fJfttabtww 
23*0} a function f»f the of tho word *iftt gen^ratecf . 
Points corr**Mpondt w valmw of the threahoM p*rttrtt«ter T 
mngin^ frtHti t3 t*t 20. GreAttir Vhhm« of imply r«wtr 
weird*- in th** Hk(. 
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Table 2 

rmimi prtmntriny unit Utftf rrquirtd to rxfitutn 
ft LAST a* a funriUm of ihr approximate pTofabilify 
q of mimiuj a/* M MP uM twre 8 
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70 
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Thnw in> far ri»»rt'hinK (lie- PJIt <UtAfafti* ( RwJeww SfrO) wilH «l 
mmlum query vftquriir*. of length 350 wiiiR 0 NXW-MHO. CPU. 
wtntr*l pro wnsing «niu 

words generated for each value of 7\ Although there 
ia a linear relationship between the number of words 
generated and execution time, Uu* number of words 
generated increase* exponentially with decreasing T 
over this range (as Sften by the spacing of a; Villus). 
This plot and a simple analyBia reveal that the 
txpeeted-Ume computational oompleicity of BLAST 
in approximately aW +bJV+fi$WINI u t where *F i* 
the number of words generated, A" is the number of 
residues In the database and a, 6 and r are 
constants. The W term Accounts far compiling the 
word list, thn # term Lovers the iiaiahasfi a cart, and 
the A' term U for extending the hifca. Although the 
number of words generated, increases exponen- 
tially with decreasing 7\ if. increase* only linearly 
with the length of the query, m that doubling the 
query length doubled the- number of words. We ho.vr- 
found in practice that T 17 is & good choioc far 
the threshold because, as di$eu«$ed below, lowering 
the p&ramcter further provides little improvement 
in the detection of actual homologies. 

IJLAST'b direct tradeoff between awnriu;y *ml 
speed is beat illustrated by Table 3. Given a specific 
probability q of missing a enhance MS? with Koore *S\ 
one tun calculato what threshold parameter T is 
required, and therefore the approximAtr* cxwution 
tim«. Combining the data of Table i and Figure 2, 
Table 2 uhow# t.hn neutral progejjtfing unit times 
required (for various valuer of g and tf) 10 srAWih the 
eurrent Pllt d^ttdbw with u tuo<lotn i|ti«ry 
sequence <jf length 2S0. To have about d 10% 
^hftnne of Tniuning ftn MSI* with the sUlistkally 
significant score of 70 require* about nine amortds ol* 
centrt) procjessinj; unit time. To reduce Ui* chance 
dC missing auch *n MSP to 2% involve* lowering 1\ 
thereby doubling the oxcnition time, Table % iljus- 
Lraf.es, furth«rmore. thftt the higher scoring {find 
mum *itatiatically significant) an MSP, the less time 
i* required to find it with & given degree of 
cejrtaixity. 

(<;) l*srformaiux of iiLABT mth 



"J'n Ktui*,y th$ pcrformAhflfs of BLAST on tcaJ dAt* t 
vw? eo m >iired A variety of proteinft with nther 



i 

memberH t?f their reapeetivc superfftttiilies (Dflryhoflf. 
J97H), commuting the true auoreii fta w«U as the 
HLAKT appruxiin&tion witli word length ftwir and 
vftrifms ftetiinfrfr of the paramftter 7 1 , Only with 
sup^rtamilieH conUUitig rn&ny distantly related 
proteins <»uld we obtain re»ult» usefully eompA.rp.ble 
with th« rAndom model of the previous aeetion- 
S(*,wehing the jjlobina with woolly monkey myo- 
globin <PIH uode MVWQW), wo found 178 
sequent^ r.on twining MSPe with eoores between €0 
and 80. Using word length four wid T jpurameter 17, 
the random model *ugge&ta HLAST should mifia 
About 24 of these MSPri; in faet T it misses 43. Thift 
poorer tiiwi cvpected performance in due to the 
uniform pattern of conservation 1a the globing, 
resulting in & relatively small number of high- 
snoring wordg between distantly related proteins. A 
contrary example wa* provided by comparing the 
mouse immunoglobulin k chain precursor V region 
{PIR code KVMSTI) mth immunogJobulin 
#equencea T uatng the same parinneterv previously. 
Of thft 33 MSP« with scores between 45 and 65 f 
BLAST missed only two; th<? random model 
Bu^gefcts it should have fn'med eight. In general, the 
distribution of mutxtiontf along gequcnoes has been 
ahown to bo more clustered that* prcdJot<jd by a 
Poitteon proccea (Usr./-olI A Corbin ( 1 971), and thus 
the BLAST approximatfon should, on average, 
perform better on real ^quemies than predicted by 
the random model. * 

BLASTVr gr^at utility » for finding high-wjoring 
MftPs quickly. Jn rhe example& abovo, the algo- 
rithm found all but ono of the 89 glubin MSPs with 
a scorft ovftr H(l, and all of the 125 immunoglobulin 
M&Pr with a score over 50- Th* overall performance 
of Wl*AST depends upon tho distribution of MSP 
scores for those tfequenceti related to the query, In 
many ijistiintais, the bulk of the MSPa that are 
distinguishable from chance have a high enough 
score to be found readily by BLAST, even using 
relatively high valuta of the T parameter. Table ft 
show* the number of MSPs with a snore above a 
given threshold found by BLAST when searching a 
variety of $ujK*rfiuniIie» using a variety of 7*. para- 
meters. In e&eh instance, the threshold H is chotien 
to tarJude scores in the borderline region, whioh in a 
full dalabaee ^careh would include chance similar- 
ities aa well as biologically «igniMcant relationships. 
l»>«n with T *qual to 18, virtually all the statisti- 
cally significant MSP& are found in most instances. 

Corripmrinjr BLAST (with parameter* m s= 4, 
T * 1 7) to the widely used KASTT* program 
(I-ipman & P«ait»on 1&85; Pearimn k Lipnftan, 1088) 
in itft rnoHt sermitlve mode (&tup = 1), *e have found 
that JiLAST ia of comparable .sensitivity, generally 
yields fewer false po$itivftS (high-scoring but unre- 
lated matches to the query), and h over an order of 
magnitude faster, 

(d) Compariwn of two long D$A sfipHmQCtr 

Hequtmce data exist for a 73,3ft0 bp aettiuji of the 
human genome containing th« ^like globin ^jene 
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Table 3 

V7w ttumke? &f MSF*fi found by ft LAST whin wurchiwy vatinun pr&rin 
mptrfhmilw in ike daiabtm (Kftwte 22*0) 



qwery weigher 



tfup#riVtm>(y 
arched 



Outolf 



tfumUfr of MHl'i* with mi I«*tu*L ,"5" 
found )>y HkANT with P pamtfittter U* 



52 



IK 



17 



In auperfWiuily 
*t lew*, A' 



MVM^V 
KVMSTI 

KVHOA 



Ctobin 

Immunoglobulin 
Snriru 1 prOUw** 



47 

do 

4ft 
46 
44 



J 16 

.9 
12 

Si) 

$1 



Iti9 

ir,r 5 

L2 

no 
as 



178 
4? 



Efc2 
12 



I (HI 

12 

A3 

24 



2TWv 

a* 



3»l 

40 
12 
fit* 
OH 
24 



ttfl 
1t4 



I - 



MVMQWv woally wrmktsy myoglobin: KVMSTl. mtiuw. Ijc k chAin prcmromr V region: OKltftft, bovln* cAMf-dependcm protein 
klno.se: JTHU. bum*n fic~5-*rtt<itrypnlft preniraor; KYUOA. tnmiu ^hymowyfwiijOfitn A: CC^itf. human tyu>chrom*> fi; FKOF, 



duster and for a ttflrreHpprtding 44,3£0 bp neM-ion of 
thfr rabbit genome (Ma^ot <if., 1^8^), Th« piir 
exhibits chree mAi'n riiASscs <tf locally similar regions, 
natttftty gcnen t ton^ inters perked repvttts= and t;ertAJh 

fthticipjiicrt weaker wimilaHtiefi J as de^HUMl hnUiw, 
^imil^r reginrtsi that t*n *LlS^n*itJ witUuuC inirw- 

dUCtlOri Of gA|)«. 

The human gene olu»tf r oontaina Hix glohiri gcneH v 
denoted $, A y t JJ t ^ and /f, while th« rAbbit t;tust«ir 
h*J5 wily four, namely ^ y r & and (AclmUlyj rnblnt 
$ ig a p6C«fJtigo^ft.) Ivwsh of thft 24 gene pAira. ot\ts 
human gyne ;ind line rabbit #enc. confetitutvjj a 
«i (pilar pair. An ali^iiiTwrit- of hth'Ii a pair require 
inwrtiort arnl dolction^. mitr tho Ihrw exftris of our 
Ifeno generally clifft^r ^om«whdt in thwt* lerjgthx frtim 
thy c.(irre&j)mit]in£ exons of ( f ho paired #en«. And 
thsrc *re oven mote **xlcnftivt» vnriaiianK aipmi^ the 
mifonfl, Thus, a collfiition of the highest (scoring 
alJgnmtnU between simitar regions can b« expttcttd 
to have -At l^&sr, ^4 alifitimentu between g«ne pairs. 

JvfamimnHan genomes wnuui nnmbm «f 
l&n^ intcrsper«K?xJ ri|wiii NC^ufcr\ocs, abbrtiviat^d 
In particular, the human /}-Ukc ^lobin 
tlustftp contd-iAN two ovfiriappc*J \A fi^tfiiontwij {a 
type of VJNK) frnd tho rabbit rlu^r has twev 
tandem LI ^quenc^t in the Hame oricntiLtion, both 
fttotind 6000 bp in l^n^th. Thesfl humnn an<J r«ihbit 
Ll sequences (juii* similar and thttir longtliA 
fli».ko them highly visible in sinitJo-rity comptt- 

t at) one. In all, fcight LI .iequ*nct$ hnv^ b<in oit^d in 
iho hurnttn <iiur*w>r unci Rve iri the rti-bhk ctwuWr, but 
bcfi^so of thfcfr rfidUnwl length arwl/or rev^wf^J 
orientation, the other published IA uequ^nceB da 
not an^ect the result* displumed helow, V#ry recently , 
another vit*ttc of *n LI j«sqncntje has l«?en di*ue>verf?d 
in the rabbit cluster (fhafuig ct a/., ll#00). 

Evpjufcicnn theory K«KR«tit tb^t an ftflCOfltrdl gene 

uluster utrAnged S^t-v-n-^fi'^ m*>y nav** ^htud 
before th* wiurtmaluin rAdiatjon, Consistent with 
this hypothftMej, there are inter-^enc simtlairiti«a 
within the /J dual ra. For fticamjile, thorft U a rtsRion 



between human € and c y that is similar to a region 
between rabbit e and y. 

Wc applied a variant of the BLAST program to 
thvh'c two itvf]uenc*s, with match ficore 5, mi&matcb 
«corc —4 and, uiitiftHy, it? ~ 12* The program found 
9H alignments storing uver 200» with ISO) being the 
highest sv«re. Of the 57 alignments scoring c*ver 350, 
45 paired genes (with each of the 34 passible gone 
pair* represented) &tid the remaining 12 involved Ll 
sequences. Below 350. int«r-gftrfe simtilarifcics (as 
dewvitad abwve) appear, along with additional 
atig^rnents of (^encs and of Ll sequences. Two align- 
rin-nti* with score* between 200 and 3JS0 do not fit 
the anifripfUttd patt«rrt. Orift r&vtals ihft newly dii»- 
oovcred Mcctifiri of LI a&quence. The other align** a 
region imimtfdiat&Iy 5' from thfe human fi gene wrth h. 
region ju^t 5' from rabbit b, Thia last alignment 
may b* thf ty^ult yf art intrachromosoma] gone 
liunvfrhion he.Wcen 6 and /f in the rabbit gen»me 
(H^rdi^on ^ Margot, )9&4). 

With smaller vatues of morp aligftments arts 
found, In particular, with w = B, an add 7 on a.) 32 
aligum^tii^ iw- " • k4 with a sc.or^ p.br>v<* 200- All Qf 
thCAe fall in one of the three glass** disuiimd a.bovp. 
TlvWs, use of a smaller w providfts no CRfientially new 
information. The dependence of various values on w 
in given in Table 4. Time w measured in aeconda on 
a 8UN4- for a simple variant of BLAST that works 
with uncompressed DNA strquenwe. 



Table 4 

7'Af. tivA$ and teneitivity of &/,A$T on 
J)NA stqweiM%$ tm a function of w 
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4. Conclusion 

The* wnrppt underlying BLAST Is simplft and 
rohuxt and thwfiHv c-an bt* jmptomcntori in a 
oumhov tif way* and utilised in a varioty of 
context** Ak twntimfed abuve, utw variation i*t ta 
Mow for gapw in thr tension step. For the applfca- 
lioftM have had in mmd, the trade off in t*pccd 
proved unaovep table, but thi* nmy not he true for 
«th«r application*. We Imvc implemnntad a Haired 
memory veraion of BLAST that load a th«? 
cornpircsKRcd DNA fiU 1 into memory onric, allowing 
&uh$oquant jwarchen to skip thia step. Wc. tiro imjito- 
menting a .similar algorithm lor oomp&ririg a ON A 
sequence to the protein database, allowing trans- 
lation in all hix reading frames. Tins permit* tht* 
detection of disi&nt prufcin homologies even m the 
Tfi.ce of common ON A aetpjeiiHrtg frrrors {ropl&flft- 
mertta and ffruna shifts). l\ B. Uwrenoe (personal 
*u immunisation) has fashion**! wore matrices 
derived from consensus pattern matching methods 
(Smith & Smith, 1990), and diffttrfcnt from the 
P AM -120 matrix iwud here, which can greatly 
decrease the timr of datable starches for x«nu«ne P H 
mo til k. 

The BLAST appmaoh iMTi miU the constniation of 
t*xtr« mely fast programs for database 1 aearehmg that 
have the further advantage of amenability to 
mathttm&tiuoJ analysis. Variation* of the h&tilu idea 
as wcsl! iw Alternative litiptamenjtatiunu, aiuh a« 
thosn described above, can ad Apt the method for 
different contexts. GWvn tho in<m^&Hin^ Mia* of 
sequence dn-tabasttst, HZjAST can he a valuable tool 
for th? nool<wui*ir bwhiftiat, A version of BLAST in 
the O programming lan^un^e ia available from the 
authors upon request (write to W. <3ish): it runs 
under both 4-2 I3.SJ> and the AT&T System V 
UNIX operating Ayalems. 

W,M. ia Huppuru*! in pan by N]H gmnt LMQ51 10. and 
E,W.M. iw nupportcd in part by Nim pram LM04WM). 
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Information Theoretic Perspective 
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EXHIBIT 



Stephen F. Altschul 

Nativruil Center for Biotechnology information 
Naiiamd Library of Medicine 
National Institutes of Health 
Btthesda, MD20S9i t U.8.A< 

f Received 1 Ociober 1990; ateeptzd 12 February 199 1 J 

Protein sequence alignment* have become an important tool for molecular biologist*, |^> oa i 
alignments are frequently constructed with the aid of a "substitution score matrix" that 
specifies a score for Aligning each pair of amino acid residues, Over the wars, many different 
substitution matrices have been proposed, based on a wide variety of rationales. Statistical 
results* hovevet, demonstrate that any such matrix is implicitly a ''log-odds" matrix, with 
ft specific target distribution for aligned pairs or amino acid residues. In the light of 
information theory, it is possible to express the scores of a substitution matrix in bits and to 
see that different matrices are belter adapted to different purposes. The most widely used 
matrix for protein sequence comparison has been the PAM-250 matrix, tt is argued thai for 
database searches the PAM-I20 matrix generally is more appropriate, while for comparing 
two specific proteins with suspected homology the PAM-200 matrix ia indicated. Examples 
discussed include the hpocalfns> human t^B-glycoprotein, the cystic fibrosis transmembrane 
conductance regulator and the globins. 

Keyivorda: homology; sequence comparison; statistical significance: alignment algorithm*; 

pattern recognition 




1. Introduction 

General methods for protein sequence comparison 
were introduced to molecular biology 20 years ago 
and have since gained widespread use. Most earlv 
Attempts to measure protein sequence similarity 
focused on global sequence alignments, in which 
every residue of the two sequences compared had to 
participate (XeedJeman & Wunsch, 1970; Sellers. 
1B74; Sankoff & KruakaJ, 1983), However, because 
distantly related protect may share only isolated 
regions of similarity. e ,g. i n the vicinity of an active 
fi,L *' Attention ha * * hifLed ** local as opposed to 
global sequence similarity measures. The basic idea 
is to consider only relatively conserved sub- 
sequences; dissimilar regions do not contribute to or 
subtract from the measure of similarity. Local B ijn> 
Wy may be studied in a variety of ways. These 
include measures based on the longest matching 
segment* of two sequence* vith a specified number 
or proportion of mismatches (Arratia tt tng 0 ; 
Arratia &. Waterman, m9) t as well as methods that 
compare all segm^ts of a fixed, predefined 
H,naow lengtn (MeL*chlan, 1971), The most, 
common practice, however, is to consider segments 
of all lengths, and choose those that optimise a 

0022-2B3S/ft.]/t 10555^11 t03,OQ/O 



similarity measure (Smith A Waterman, 1&SI; Goad 
& Kanehisa, 1982; Sellers, 19S4). This has the 
advantage of placing no a prion restrictions on the 
length of the local alignments sought. Most data- 
base search methods have been based on such local 
alignments <Upmari & Pearson, 198&-, Pearson & 
Upman. 19BB; Altschul ti al r , 1990). 

To evaluate local alignments, scores generally are 
assigned to each aligned pair n f residues (the set of 
such scores is called a substitution matrix), as well as 
to residues aligned with nulls; the score of the 
overall alignment is then taken to be the sum of 
these scores. Specifying an appropriate amino acid 
substitution matrix is central to protein comparison 
methods and much effort ha* been devoted to 
defining, analyzing and refining such matrices 
(McUchlart, 1971r Dayhoff ef aL, J 978; Schwartz Jfc 
ftayhoff, 1978; Feng e< ai., 19&5: Rao, 19S7; Risler ef 
«Jm 198B). One hope has been to find a matrix best 
adapted to distinguishing distant evolutionary 
relationship* frpm chance similarities. Recent 
mathematical results (Kariin A AiUchul, tft^O; 
Kariin tt aL t 19&0) allow alt substitution matrices to 
be viewed En a common light, and provide a 
rationale for selecting particular seLs of "optimal" 
scares for lacal proton sequence comparison. 

<£> IB9I .Acftdcmic Pkh Limited 



Received from < 415 398 3250 > at 1 1(8/01 5:38:26 PM [Eastern Standard Time] 



MOV. 3.2001 



2; 26PM" — FLEFR HOHBACH TEST 



&1 wjlm-doc. ceLivarrs NO. 9026 

J 



P. 3/26 



668 



5. F. AllKhul 



2. Tht Statistical Significance of Local 
Sequence Alignments 

Global Alignments are of essentially no ugc unless 
they can allow gaps, but (his id not true for local 
alignments. The ability to choose segments with 
Arbitrary starting poaitione in each sequence moans 
that biologically significant regions frequently may 
be aligned without the need to introduce gape. 
While, in genera], it is desirable to allow gaps in 
local alignments, doing ao gristly decreases their 
mathematical tractabitity. The results described 
here apply rigorously only ip local alignments that 
tack gaps, i»e» to segments of equal length from each 
of the two sequences compared. Some recent data- 
base March tools have focused on finding such align- 
ments (Altschul & Upman, 1990; Altschul tt <tf„ 
1000), However, the statistics of optimal scores for 
local alignments that include gaps (Smith tt 
I 085; Waterman tt a/., 1987) are broadly analogous 
to those for the no-gap case (Karlin k Altschul, 
1990; Karlin tt aL, I9§0) t where more precise results 
are available. Therefore, one may hope thAt many of 
the basic ideas presented below will generalize to 
local alignment® that include gaps, 

Formally, we aaaume thai the aligned amino acids 
4t and tij are assigned the substitution score ty. 
Given two protein sequences, the pair of equal 
length segments that, when aligned, have the 
greatest aggregate score we call the Maximal 
Segment Pair (MSPt). An MSP may be of any 
length; its score is the MSP score. 

Since any two protein sequences, related or un- 
related, will have some MSP score, it is important to 
know how great a score one can expect to find 
aimply by chance. To address thie question one 
needs some model of ehanee. The Amplest is to 
assume that in the two proteins compared, the 
amino acid a* appears randomly with the prob- 
ability p t . These probabilities are chosen to reflect 
the observed frequencies of the amino acids in 
actual proteins. For simplicity of discussion we will 
ft&sume both proteins share the same Amino acid 
probability distribution; more generally, one can 
allow them to have different distributions. A 
random protein sequence is simply one consLrueted 
according to this model. 

For the sake of the statistical theory, we need to 
make two crucial but reasonable d&sumpt iona about 
the eubstttutian Bcorefi. The first ia that there be at 
leaat one positive score and the second is that the 
expected ecoro Pifyary be negative. Because we 
permit th« length of a segment pair to be adjusted 
to optimize its score, both these assumptions are 
necesaary also from a practical perspective. If there 
were no positive scores, the MSP would always 
consist of a single pair of residues {or none at all, if 
this were permitted), and such an alignment is not 
of interest. If the expected score for two random 
residues were positive, extending a segment pair aa 



t AbbreviAtiona used; MSP, Maximal fc^mcnt Fair; 
Ig. immunoglobulin. 



far ok possible would always tend to increase it* 
ftcore; thia violates the idea of seeking local align- 
ment*. Substitution matrices used in other contexts, 
such as global alignments (Needleman & Wunsch, 
2970) .or local alignments using windows 
{McLaehlan. 1971), need not satisfy these 
constraints. However, unless otherwise stated, it 
will be assumed below thai any substitution matrix 
satisfies the two conditions described. 

The statistic*! theory of MSP scores (Karlin & 
AlUchuI, 1990; Karlin at af„ 1090) involves a key 
parameter A, which is the unique positive solution to 
the equation: 

IW^*I. (1) 

u 

Notice that multiplying all the scores of a sutatitu' 
lion matrix fay some positive constant does not 
effect the relative scores of an)- subalignmenU. Two 
m& trices related by such a factor can. therefore, be 
considered essentially equivalent- Inspection of 
equation (1) reveals thftl multiplying alf score* by a 
also has the effect of dividing A by a. The parameter 
X may, therefore, be viewed afi a natural wale for 
any scoring system: its deeper meaning will be 
discussed below. 

Given two random protein sequences ftfc described 
above, how many distinct, or "locally optimal" 
(Sellers, 1984) rfSPs with score at least ere 
expected to occur simply by chance? Thi* number In 
well approximated by the formula: 

where A" ip the product of the sequences" lengths, 
and H is an explicitly calculable parameter (Karlin 
&. Aitachul. 1990: Karli* tt <*/.. (690). When 
comparing a single random sequence with all the 
jsequences in a daubase. setting A" to the produrl of 
the query sequence length and the database length 
(in residues) yields an upper bound on the number 
of distinct MRP& with score at least A* that the 
search is expected to yield. 



3k Optimal Substitution Matrices for Local 
Sequence Alignment 

Formula (2) allows uss to (ell when a segment pair 
has a significantly high score. However, it does nut 
assist in choosing an appropriate MibstiUiUon 
matrix in (he first plftce. A fiermid class of results, 
however, has direct hearing on this question. The^e 
slate that among M5?Ps from the comparison of 
random sequences, the amino acids rt f and n i are 
aligned with frequency approaching qtj** PtPj**' 1 * 
{Arrati**/a/„ 198S; Karlin & Ah«chul, l&m Karlin 
ef aL 19&0: Dembo & Karlin, 1901). 

Given any ftubatitution mntHx and random pro> 
tcin model- one may easily calculate the wt of 
target frequencies, just described. Notice that 
by the definition of A in equation (1), these target 
frequencies sum to K Now among alignment* repre- 
scenting distant homologies, the amino acid a are 
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ptiirrd with certain characteristic frequencies. Only 
if i hiro «JornsB|K>nd to a matrix 1 * target frrqucnnVn, 
\\ htifi been argued, can the matrix be optimal for 
distinguishing distant local homologies from sirni- 
Irtrttirfl due to chance (Karlin & Altsohuh 1990). 

Any substitution matrix has an implicit set of 
lrtrjst'i frequencies for aligned amino acid*. Writing 
tho worm of the matrix in terms of its target 
frrqucncits, one has: 



In Other words, the score for an amino acid pair can 
Ih,' written oa the logarithm to some banc of that 
pair's target frequency divided by the background 
frequency with which the pair occurs. St*ch a ratio 
compares the probability of an event occurring 
under two alternative hypotheses And is railed a 
likelihood or odds ratio. Score* that are the 
logarithm of odds ratios are called log-odds scores. 
Adding such scores can be thought of as multiplying 
the corresponding probabilities, which is appro* 
priate for independent events, so that the total acore 
remains a log-odds score. 

Log-odds matrices have been advocated in a 
number of contexts, {Dayhoff cl aL 1S>7B; Gribslcov 
<?t aJ., 1987; Stormo & HartzeU, 19B9). The widely 
used PAM matrices (Dayhoff ei oi M 19?$)/ for 
instance, arc explicitly or this form. Other substitu- 
tion matrices, though based on a wide variety of 
rationales, are all log-odds matrices, but with 
implicit rather than explicit target frequencies. 
Therefore* while one may criticize the method 
described by Dayhoff ct at. tor estimating appro- 
priate target frequencies (Wilbur, 1085). th« moat 
direct way to derive superior ma trices appears to be 
through the refined estimation of amino acid pair 
target and background frequencies rather than 
through any fundamentally different approach. 



4, Substitution Matrices for Global Alignments 

While we have been considering substitution 
matrices in the context of local sequence compari- 
son, they may be employed for global alignment as 
wed (Xeedleman A Wunach P 1070; Sellers, 1974: 
Schwartz Dayhoff, 1978). Thar* is a fundamental 
difference, however, between the use of such 
matrices in these two contexts. For global align- 
ments* ok previously, multiplying alt scotch by a 
fixed positive number has no effect on the relative 
scores of different alignments. But adding a fixed 
quantity a to the score for aligning any pair of 
residues (and a/2 to the score for aligning a residue 
with a hull) likewioe has no effect. Scoring systems 
that may be transformed into one another by means 
of these two rules art, for ail practical purposes, 
equivalent. Unfortunately, the new transformation 
means that no unique log^odda interpretation of 
global substitution matrices ib poa&ible, and it is 



doubtful that any "target distribution" thwwm 
can be proved, ft may be possible to make a 
convincing case for a particular substitution matrix 
in the global alignment context, but the argument 
will most likely have to b* different from that for 
focal alignments (Karl in & Altschul, 1990). The 
same applies to substitution, matrices uaed with 
facd-length windows for studying local rimilaritiea 
(McLachlan. 1971; Argos. 1087; Stormo & Hacirall, 
1989): a fixed quantity can be added to all entries of 
such a matrix with no essential effect. It 13 notable 
that while the PAM matrices were developed origin* 
ally for global sequence comparison (Dayhoff et a/., 
1978). their statistical theory hoa blossomed in the 
local alignment context. 



5. Local Alignment Scores as Measure* 
of Information 

Multiplying a substitution matrix by a constant 
changes A but does not alter the matrix's implicit 
target frequencies. By appropriate scaling, one may 
therefore select the parameter X at will. Writing the 
matrix in log-odds form, such scaling corresponds 
merely to using a different implicit base for the 
logarithm. One natural choice for X ia I, so that all 
scores become natural log&tithme. Perhaps more 
appealing is to choose A In 2 ft: 0*693, &o that the 
base for the log-odds matrix becomes 2. This lends a 
particularly intuitive appeal to formula (2). Setting 
the expected number of MSPs with score at least S 
equal to p, and solving for &, one finds; 



K 

*~logj j+Ioga A'. 



For typical substitution matrices. K is found to be 
near 0 1, and an alignment may be considered 
significant when p is 0 05- Therefore the right-hand 
Kide of equation (4) generally fa dominated by the 
term logj JV. In other words, the score needed to 
distinguish an MSP from chance is approximately 
the number of bite needed to apecify where the MSP 
e tarts in each of the two sequences being compared. 
(One bit can be thought of as the answer to a singlr- 
yes-no question; it is the amount of information 
needed to distinguish between 2 possibilities. It 
become* apparent that, in general, log* $ biu of 
information are needed to distinguish among -V 
possibilities.) 

For comparing two proteins of length 250 amino 
acid residues, about 16 bits of information are 
required: for comparing one such protein to a 
sequence database containing 4,000.000 residues, 
about 30 bits are needed. When cast in this light, 
alignment scores are not arbitrary uumber*. By 
appropriate scaling (multiplying by 093) they 
take on the units of bits, and rough significance 
calculations can be performed in one's head. 
Furthermore, when so normalized, different amino 
acid substitution matrices may be directly 
compared, 
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t5. The Relative Entropy of a 
Substitution Matrix 

Th* Above review of previous results has provided 
us with the- necessary tool« for the analysis that 
folio wa. The ultimate goal is to decide which *ubati- 
tution matrices are the most appropriate for data- 
biw searching and for detailed pairvme sequence 
comparison. 

Given a random, protein model and a substitution 
matrix t one may calculate the target frequencies ?y 
characteristic of the alignments for which the 
maf.rU U optimised. A useful quantity to consider is 
the average score (inform ation) per residue pair in 
these alignments. Assuming the substitution matrix 
ia normalised aa described above, this value i& 
aimply: 



M PiPj 



Notice that # depends both on the substitution 
matrix and an the random protein model. In 
information theoretic terms, M is the relative 
entropy of the target and background distributions. 
The origin of the name need not be of concern. The 
important point is that, for an alignment character* 
ixed by the target frequencies q tJ , ii measured the 
average information available per position to 
distinguish the alignment from chance. Intuitively, 
tho higher the value of the relative entropy of target 
and background distribution*, the more easily thty 
are distinguished- For a high value of rV\ relatively 
ehort alignment* with the target distribution can be 
distinguished from chance, while, if the value of// is 
lower, longer alignments are necessary, * 

It is interesting to examine the PAM model of 
molecular evolution (Dayhoff ei at., 1978) from this 
standpoint. From a study of mutations between a 
large number of closely related proteins* DayhofT 
and co-worker* proposed a stochastic model of pro- 



tein evolution. The amount of evolutionary change 
that yields, an average, one substitution in 100 
amino acid residues they called one PAM- Using 
their model, piw may easily calculate the frequency 
with which any two amino acid residues are paired 
in an accurate alignment of two homologous pro- 
teins that have diverged by any given amount of 
evolutionary change. These- target frequenricu m«y 
tlien be used to construct log-odd? matrices and. in 
particular, the widely used PAM-2.50 matrix. 
Dayhoff tt al. (1078) orfginally proposed this matrix 
for the global alignment of two sequences suspected 
to be homologous, but it has since been used to 
search protein databases for local alignments to a 
query sequence {Lipman St Pearson, 1985; Pearson 
A Lipmatt, l&BS), One may therefore inquiry 
whether 2fiO PAMa yield reasonable target frequen- 
cies for database searches. 

Assuming the model described by Dayhoff el a/. 
(1078)* Tabl* 1 \\m the relative entropy'// implicit 
in a range of PAM matrices. As argued above, 
distinguishing an alignment from chance in a search 
of a typical current protein database using an 
average length protein requires about 30 hits of 
information. Accordingly, for an alignment of 
segments separated by a given PAM distance, one 
can calculate th« minimum length necessary to rise 
above background noise; these lengths are recorded 
in Table K For instance, at a distance of 250 PAMs. 
on average only 0*36 bit of information is available 
per alignment position. To be statistically signifi- 
cant, such an alignment would need to have a 
length greater than about 83 residues. Many biologi- 
cally interesting regions of protein similarity arc 
much shorter than this, and accordingly need a 
stronger signal to be detected. A local alignment of 
length 20 residues will need about 15 bits per align- 
ment position, while one of length 40 residues will 
need about 0-75 bit. Table I shows that surh align- 
ments will not be detectable if I heir constituent 
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«prnentR have diverged by more than about 7 A and 
ISO ^A.Mfl, respectively. 

7, PAM Matrices for ftaUbAtt Searching and 
Two-se^uencc Comparison 

The r«l&tiv<* entropy Associated with a «preific- 
PAM distance indicates how much information per 
position is optimally available. For ft fgiven align- 
men*, one can attain such ft scare only by U&ing the 
Appropriate PAM matrix, but. of course, before ihe 
Alignment fa found it will not be known wliich 
matrix that i*. It has therefore been proposed that & 
variety of PAM matrix be u*ed Tor database 
ftearehee (Colling «f a/.. 19&8), We seek here lo 
Aualy^B how many such matrices are necessary t and 
which should be used. 

Suppo&e one uwb & matrix optimized for PAM 
dUtjLnctt J»f to compare two homologous proton 
Momenta that *re H.cuift.Uy separated hy PAM 
distance />. For a rang* or valves of M and £>. th* 
average score thieved per alignment position in 
shown in Table 2. Notice thftt for any given matrix 
mW , the smaller the actual diatanw D h the higher tht 
score. On the other hand, for a 6pecitfc distance 
ihc highest ficoro correspond* to the matrix with 
PAM distance M - O: this score ia juat the relative 
entropy dieeux^d above. Using ft. PAM matrix with 
*W neftr />. however, can yield a neAr-optimal score. 
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For txArnpUi, the relative entropy for V =« 160 ia 
0-70 bit, bat any PAM matrix in the range 120 to 
£00 yield* at least f><57 bit per position- In practice , 
how n«ar the optimal ia it important to beY 

As argued above, fqr & given PAM dlfttAnoe there 
is a critical length At which alignments ar* just 
diKtinguiehK.ble frorn thane* in & typic«I current 
database search: Lhcw leneths are recorded in Table- 

1. For the sake of analysis, we will utume that it is 
worth performing ah extra search <miing a dinTerent 
PAM matrix) only if it i& p,ble to increase the eoore 
for such a critical Alignment hy about two bits, 
corresponding to a. factor of 4 in tiignificanpe, Since a 
critical alignment has about 30 bits of information, 

will therefore be fiattefied using a PAM matrix 
thftt yields & ecore gre«U«r than 93% of the optimal 
Achievable. Utiing data such ag those Bhown in Table 

2. one can cftloulatft fot- which PAM distances D 
(and thus for which critical lengths) a given matrix 
-V 13 appropriate; the requite are recordctd in Table 

3. Our experience has shown that perhaps the most 
typical lengths for distant local alignments are thoe« 
for which the r\AMM20 matrix gives near-optimal 
scores, i.e. lengths 19 to SO residue*. Therefore, if 
One, winheft to use a single standard matrix for 
database sootrches, the PAM-120 matrix (Table 4) is 
a reasonable choice, Thtft matrix mity, however, 
mis* ehorL but strong or long but w?.ak similarities 
that oontain sufficient information to be found. 
Accordingly, Table 3 show** thftt to complement the 
PAM -I SO matrix, thn PAM-40 and PAM-240 (or 
traditional PAM-2fiO> matrices can be used. 
Additional m a. trices should ifopnpvn the detection of 
di«UnL Birnilaritiefr only marginally (»,e. raise their 
nenraa by at most It bits). 

If, rather than searching a database with a query 
sequence, one wiahea to compare two sipecifir 
firqtipnceft for which one already has evidence of 
rel&tednosg, the ban kg round noi&e is greatly 
decreftsed. A© disci) ased above, for two proteins of 
typieal lengthy about 16 biU are needed to 
diBtinguisli a local alignment from chance. 
Accord ingly* applying the wtme Criteria as before, » 
matrix ahould he <roneidered adequate for tftOW 
PAM diRtftnoca at which it yields an averace Jicor* 
within S7% of the optimM, In Table 3, we lift t the 
ran^e of critical length* over which variant PAM 
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of 2 bcl W «, a " d . l5,nce a P'obabillty factor 

units for 

ipjh 10, 3/(ln ,7 

thai, matrix can be thenwhi «r 0 0It SCOre m 
one-Lhird of a bit thou « hfc ° r « approximately 



8, Biological Examples 



And PAM-120 scores for MSPs renrpcontir,^ w , . 
re talloi-hfp, u> four different Z^"^^" 1 
ail cases, we cDneirfor ^i^t- , ^ ^ wquencee. Jti 

sJn« neither the MR databo^ n« ^ " rihermor *, 
Requenoe ever precisely fil^Th- 5 / <JUer - V 

meter A varies .lirttli- r«L? ( 8,1 the pafa - 
scores from Table a iT.if . he Pj4M-12o 

*o-d be U ATftt^jJ 

approxifrtatJon. a sl 1 ght 



(a) Lipocaliu* 

isolated slr^he.Tf ™|atrv^ Pr ° te "i S • h « '»7S and PAM fao /T n ! ' AM ; 25 ° C^vholT«V. 

Bb'wpnun thai belong T ihVTTT 

wv^r*l menl |^- „ r ,f lruaur *f Hr « availaMc Tor 

androgen^LEdent -? J ,he ? u rH"rr»mily are rat 
gen dep^dent epididymal pro i*i n (PIR code 



important «trnctur ft | felttoL^I i °T ° ther 
tl*al }»» general the mutatin^f" r hM been ^"'ed 
proteins are nof p * ^ ? I<,n B eones »di„g for 

that 6 l,; r t/„; " r °^™ « 1083,, e ugg CStillg 

'o«ti„ fi distant Xi'" " 0l bC 0 P UmaI for 
I« ih«examp.e 6 beIow , We ^ ^ ^ 
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Table 5 

r&rwc -Af 5/V representing distant tclatwn*fLips l from .-rcarc/tcfl f/«s /V/t j«ro(e*n 
*r^uentfcr database frcfecwe 2G 0} wiih human apoiip*ypTi^ciit P /jrccurani* (PIR e«fe 









Optimal PAM-250 


QptlDl&L PAM-1ZO 


PXR cod* 


Optimal FAH-250 alignment 




score (bits) 


a core ( bit a) 


LP HUE 


25 LGKCP^PPVQENrovWKTLGRWYfil 










12 IAACT£ fiAVVKD FD I £Kf*LGFWY£ I 


36 




33*5 




27 HDTVQP KFOQD KFLGRWY 


44 


as. 7 


33-5 


HCHU 




47 


53.0 


30,5 


Highest chance *»l±«ymxL^nt. scots: 




27*0 






code of sequence involved: 




500758 


E0O7SP 



LftH'D. humin *pr> lipoprotein t> precursor; HQKTAD, ril *ndrogcn»dependtm, c[<ididymhi IB S K 
protein precursor. r** prort^Undin-D syMW*; HCHU. human d r fti[erc?glabulta/ 

inters, trypsin inhibitor pftfutnor: hitmtn eurficc glyeo|*rolein CDI* (nwuraor. 



iSQRTAD; brooks €t 1986), rat prgstftglandin-D 
synthase (PIR code A32202: Urade ei nf.. 1080) and 
human « r initroglobuliii (PIR eoo> HCHt"; 
Kfcumeyer tt d/., 19&6). The second of these only 
recently been recognized as & member or the super- 
family (M. S. Boguski & M, (?. Peitsch. |>er*onal 
Communication); it \e the first such member with 
known catalytic activity (Urade et a/., 1989). 

L'fiing PAM-250 scores, the maximal segment pair 
for each of the£e sequences when compared to 
L.PH CD is shown in Table 5. These local similarities 
correspond to one or two motifs that are conserved 
throughout the euperfamiTy (Hoguski & »States. 
1990). The scores for the three alignments are 27-0. 
25'7 and 23-0 bits, respectively. However, the high- 
est acore from a protein in the database Unrelated to 
J-PHUD is 27-0 biUi involving human surface ^iveo- 
protein 01)16 precursor (PIR code 800758: 
Simmons & Sieed. The PAM-2A0 matrix 

therefore fails to separate the homologous align- 
ments shown from background noise. In contrast, 
using the FAM-)£0 matrix of Table 4 S the scores for 
the three alignments jump to 33*5 % 33-5 and 3(>'5 
bits, respectively. (The 1st 7 alignment positions for 
M*HUO-SQKTAl> shown in Tabtp 5 are dropped in 
An optimal FAM-120 alignment, **r ai-e the I fit 3 
l>osition£ for the, LPHCJ) A32203 alignment,) Tim 
raises their ecorea above that of the best chance 
PA M- 120 alignment (2*10 bits), again Involving 
human Burface glycoprotein ClJlo precursor. Notice* 
that m both cases the estimate thai about 30 oil* 
are needed clearly to distinguish an ,\I«P from 
chance is valid. For (his query sequence, no 
relationship is found using the f AM-2SU matrix 
that is mifeKed by the PAM-120. 



(b) //vman at j& ytycoproleiit 

We searched the PIR database with human 
4|B-glyeo.prolem (PIR code OMHV1B: Ishioka ?T 
a/,, 166G), a plasma glycoprotein of unknown func- 
tion, and a member of the immunoglobulin supertax 
mily. Using the PAM-250 matrix, the only protein 
in the database with an MSP that rises above back- 
ground noiw is pig Po2 P protein (PIR code 
PI.0030: Van de Weghe tf al, 1986), which achieves 
a score of 32 3 hit*. As shown in Tahte o\ the Score 
for Ihifi known homologv (\'an de Weghe ef at.* 
1988) ris^s to 45-0 bits when the PA.M-I21) matrix is 
used instead. In addition, two .proteins with 
immunoglobulin domains, khiaac-related trans* 
forming protein precursor (PfR code S00474-. Qiu ft 
1888) and human Ig #c chain precursor V-JII 
region (PIK code K3HUVH; Pech & Zaehau, 1984). 
achieve scores of 2(K) and fl$-5 bits, respectively . 
Table 6 illqptrates thai both these aimilarities are 
only just .distinguishable from chance, *nd that 
using the PAM-25Q matrix both similarities drop in 
wore by at Icaat four bits. 

(a) The ty&lic fibrosis transmembrane 
conduaante rtgvJalor 

The cause or cystic fibrosis nan been traced to 
mutations in a protein that bears striking similarity 
to nmny proteins involved in the transport or 
Kubsianccs across the cell membrane (PIFt cpde 
A3030O; Riordan cf a/., /98U). Characteristic 
features of the protein «rc two nucleotide (ATP)* 
binding folds (Higgin* ef a/., I9BG). Wlien the PIR 
database is searched with A30300, many related 
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Table 6 

Thm Affl/** rrpT€&«niinq dimtattt rc{atio*i*hipa , fr#m oearcAc* of th* P//2 prttein xcqutn^c dalabaae (release 

2G*OJ unth Kumw at fBytytvprolein {FIR code QMifUlB} 



- 


Of>tli»»i PAH -2 50 iii<jntwnr 


Opt In A PAH- 250 
■coc* < biu) 

























2U 




£00474 It 


ff IHF AOS Ela VEAGPTLSfcTCJlJp 




29.0 


K3HUVH 19 




46 22. O 






tMghant- ctiAAc* *ll.gnf»*AC flaw*; 








coo* o£ a«qu*kw* mvoXvvcli 


JQO10Z 





OMJU'IU. human cc ( lJ-giy*0|in>l«lm rUMUW. pig Fo2 r protein: S*OQ474, kin&wrrl&led irnnalbrmU^ protein fVil? Jrtrrwsof; 
K3HWH. hum*n Xg k *h»in precursor v-»| r*gi°n {Vh): J 00102. eggfilftiK muMic vim* RttA Kftlinw (Oaor*Q»K**s* rt at.. i»Mm : 
"CfiMHH. Strwptom^g Aygrainjfcin U yihosf»holr*nsrera*e (%il*r*in e/ «/.. 



proteins may be identified easily utiing either the 
or the P AM- ISO substitution matrix. 
However, several distant relationships present are 
harder to detect. Trt Table 7 arc. shown four optimal 
PAM-250 alignments, representing homologies to 
each of the two A3O$0O nucleotide -bind tag fold*. 
None of these alignments has a PAM-250 score as 
great, as the highest chance score of 31-3 bits. In 
contrast, when the PAM-IJ20 matrix is uted„ the 



Alignments jump in score by 4 to almost 12 hits, 
giving all but one a score greater than the highest 
clinnce PAM-130 score of 330 bits. (The boundaries 
of the optimal alignment change slightly under the 
Attorn ate scoring scheme.) No biologically signifi- 
cant similarity is distinguished by the PAM-250 
matrix that is not found using the PAM-12G. The 
relatively high chance scores found in this example 
are partly attributable to the length of the query 



Tabic 7 

Fcur representing distent rclationtKi pa. frvm starches of the P//t protein Kqucnr* database (rzUnxe 

BG OJ with cystir fibrosis transmembrane conductance regulator (I* IK code A 30 300) 



riR dad* 



Optimal FAM-^p «lignln*nt 



Optimal »ah~z&o optimal r*H-13Q 
■QOE* (bin*) peor* (bita) 



24.1 



40.0 
35-0 



77 



A34416 



35,0 



»> IRijfcour pf W„ A3iHlH, I Aonryi •Q-l-prulrin svnlh^Uw (frafmcnl) Mohwioin tf aC V»m. 
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# AM V L L ri 

V M fc G HA 

$OCIJ4 10"* »^WMT*I^ArWLHI^tCJUC«SCSrNQKrW.^Wt«SIVQXVT, HZ 

Figure 1. The PAM-2/50 maximal $rginrnt pairurbnwl 
bean leghenwglobin 1 (Pin code (Ji*VF) mid 
nuunitwr hemoglobin I fPIR r<Ki<* StXilJH). Identical 
re*udu«tf are. *ehoed on the central Hut*. PAM-25U *co re. 
jws-a bitd: length. 92 nuHdurn- 



fiequen.ee ( 1*80 residues), <Mid partly to its compos! * 
tjon, which renders the parameter X slightly smaller 
than in the previous exam plea. 

(d) Globing 

It is possible to find examples of long alignment*, 
representing distant relationships that are better 
distinguished by the PA M -250 than by the 
PAM*I20 matrix. In practice such examples are 
rare* for some of the reasons discussed above. The 
globins are one gttperfmmily in which sequence diver- 
gence has been relatively uniform over the length of 
entire proteins. A* «*. result, eome sequence relation- 
ships within Ihia euptrtarnily become apparent only 
with scoring aystcrna tailored for long but very weak 
alignments. 

For example, searching the PTR database with 
broad bean leghcmoglobin I (PIH code GPVF; 
Richardson et al. w J $75). the atignment with sea 
cucumber hemoglobin I (PTR code S06I34; Suzuki , 
108ft), shown in Figure 1, is found having a 
PAM-250 score or 25 3 bile. This U almost as high as 
the score of the beat chance MSP (36*7 bits), which 
involves Saimom.Ua lyphimurium cystathionine 
fi-\ya$s (PIR code JV002O; Park & Stauffer, lfi89)- 
The alignment is 92 residue pairs \on%\ only 1 4 of 
these pains involve identical amino acid residue*, 
and they are spread fairly evenly along the align- 
ment. This particular similarity is totally obscured 
when PA M- 1 20 scores are used. The best region Of 
the alignment shown then involves residues 100 io 
1 33 of the leghemoglobin sequence and has a store 
of only 13 bits, while the beet chance PAM-120 
alignment, involving mouse hepatitis virus El 
membrane glycoprotein (PIR code VGIHKl; 
Armstrong ei 1984), scot** 27-5 biia. 

Nevertheless, a* in the previous examples, a number 
of relationships are distinguished bv the PAM-120 
matrix but misfifed by the FAM*25Q* 

9. Conclusion 

This paper has analyzed the properties of ftmiho 
acid substitution matrices in the Context of local 
alignment lacking gaps. This in exactly the sort of 
alignment nought by the recently developed 1 J LAST 
database search programs (AUschul et 100,0; 
AlUohul A. Lipman. 1090). \V e have concluded that 



for protein datab&aea of typical current aize {about 
1 w lO 1 renidut^j, the most broadly sensitive subati- 
Lution matrix should be a, lag-odds matrix with 
relative entropy of about one bit, e.g. the PAM-120 
matrix, Tn order to detect abort but strong homo- 
logies or long but weak ones, this matrix can be 
complemented by the PAM-4Q and PAM-250 
matrices; additional matrices should be of only 
marginal utility. Of course, many database search 
methods such the FAST A programs (Llpman & 
Pe&rfton, 1935; Pearson Ac Lipman, 194JB), seek local 
alignments with gapa, and such measures are paten- 
ti&liy more sensitive to distant homologies. 
Unfortunately, if gaps with awopUted scores are 
allowed, the specific quantitative discu&sion above is 
no longer correct, Nevertheless, the general thrust 
of the arguments should still apply, and theory and 
experiment suggest that analogous results will hold 
for local alignments with gaps (Smith e* a/., 1985; 
Waterman ti at,, 1987; Collins el of., 1BSB). 

There arc, of course, many much more involved 
ways for assessing local alignment than those 
discussed here. Scores can be assigned to aligned di- 
residues or tri -residues; they can depend on align- 
ment length (Altschul & Erikson, 1986): or they can 
he complex combinations of various scoring 
methods (Argos, 1987). Protein databases may also 
be searched with position -dependent scores or 
"profiles" constructed from multiple alignments 
(Taylor, 1086; Grifaskov «f at. t 1987; Patthy, 1987). 
In certain contexts such systems may well be more 
Sensitive than the straightforward local scoring 
system considered here. Two advantages oF simple 
additive scores are their am en ability to powerful 
algorithmic methods (Altschul el at. s 1990) and to 
rigorous statistical analysis (Karlin & Alt&chu], 
1990; Karlin et al. t 1990). Such an&ly$ia may also 
yield insight into the properties of more complicated 
ei: ft ring schemes, 

Th? author (hunk* Drs David Lipman. Mark Bogu^ki 
and Andre*- MrLacMan for helpful conversation* tnd 
suggestions on the manuncripL. 



References 

Att«chu| f F. & KricUton, B. \V_ (J98G), A nonlinear 
measure of subalignment similarity «nd its 
significance level*. Buff. Math. Biol, 48. 617-632. 

Alt«ehul, A. F. A Lipman. D. J, (1990). Pratrin dMC^bitse 
Marches for multiple alignments. Prac. XaL Awd, 

AltarbvJl. K F.. Gith. \V.. Miller. IV.. Mym. E. W. & 
Lippnan, J). J, (1990). Bfteir lowtl alignment wnrrh 
tooL J. Mot. Biof. 21 5. 403-^10. 

Argos. P. (T9H7), A «rnfllifvt» procedure to rompuir atnliw* 
irid scquenees. J. Mvl. Mot, 193. 385-390. 

Armstrong. J.. Xi>mann, H- .SmePkrtis, S.. Rotlirr. P. & 
Warren. O. (1984). S^qornre and topology of h model 
iuira^lhiltfr membrane protein. El glveoprotriiir 
from ft nirohavirutf. Xntvrf {London). 308. ".il-TJSJ. 

Arratia. R. & Watermdn. M.S. (1DB0). The ^rflu^-Rrnyi 
Rtrone law for pattern matching with a £ivru 
proportion of mjtnwlf he*. Ann. Prah. J 7, 1 15;!- 1 109. 



Received from < 41 S 398 3250 > at 1 1/8101 5:38:26 PM [Eastern Standard Time] 




i NOV. 8.2001 2:30PM ""FLEHR HOHBACH TEST 



1 



NO. 9026 P. 11/26. 



564 



JS. F. AHsthid 



ArrAtiH. R,.. fiordon, lr. & Waterman. >J S. (108(1). An 
extreme valu? theorv fur aeqiifrnre matrhing. Ann. 
14. A? 1-903. 

Arrtttfa, R_. Marrfe. 1', & Walerrnaiu M. «. fltttfti). 
Aturhaetir wrahble: large deviation* for *?(|Ufu<-t^ 
uith sriirea. ./. /tp/if. /'row. 25. MO. 
Boguski* M, «, t Sutw, I>. J. (1990), Molecular rojuniH' 
database*; and their Uws. to l*ratrin fCnyinrrrinyj A 
i*rafti*at Approach (Keea. A. ft.. lVetr.pl, H. A 
Sternberg. 3lf. J. E.. wl»c). eh*th S- l*" 1 **!**. Oxfctr^l. 

Hrooka. D. E M Means. A. R.. Wright. E. J., Singh. 5. P. & 
TIvm. K, K. (1986). Molecular cloning of (he cDXA 
Tor t*o major androgen -dependent oecretory protefnn 
of 18-5 kttod&Itortfl wynthosited by the r»i epididymis. 
Biol. Chem. 261, 4056-4901. 

Oollrna. J. P.. Coupon, A. F, W. & Lyall. A. (MWH|. Th* 
migniriraiH'T t»f protein reqwrif* tiiiniUritk*}*- ('wtiput. 
Appt. Ditnri. 4. 87-71, 

fWtan. J. W.. Matron. IV 4: AUnU. I>. I>. HlWi) . f/ttt(' 
and penra for iroh(HI l-ferri'throim* tranwjKPPt 

into Evrh+richiti c<*K J. HnrlrrhA. 169. 

M4*. 

CToH-mn. K* U r .. Xewcomer. M. & A ./ones. T. A. tl*>£K1|, 
(>y«ta|)ogra|iuu' rehnemrnt of human iterwtt r* i 1»inl 
binding protein at 2 A wwtluiuw, f'rvttin*. 8. 44^<it. 

Dahl. M. Kr. Fraiim*. E.. Saurin. \V,. Boox. \\\. Murium. 
M. l>. 4 Hofnung. M. (1*89 J- Comparison of 
isequenwK from flw malB region*; of Hatnwnrffa 
typhimuriuni and Entrrohtifirr nt>rtnj**i\M with 
fSnekerirhfa fo/f K|2; a potential now rrKuhtlury wit* 
id the iuternjwronit- regitm, M<tf. ftrn. tojiw. 216. 

DavhofT, >L 0.. KehwarU. ft. >l. Drrutt. ft, f\ (M>"H)- A 
model of evolutionary ch^ti^P in proleinst. In Aftm ttf 
PnXtin Hnjuriirt ttnri titrurturv (l)ayhfiff. M. <>-. *fd.\. 
vol.5. «U]>]il.3. it\j.$4R-3&2. Nat. limmecl. Mks. 
Found.. Washington, DC 

E>cmbo, A. i Karlim S. (1&9I). Strong limit lawa of 
«IU|^rical ryntiionJiJ* for Jarga enteedtnres O?" partial 
Bums ofLl.D. variably. Ann. Frob. In th* prtws, 

Drayua. T., Mt Lean, \V„ Wioh. K. L.. Tivnt. J. M.. 
Drahkin. H. A. & l-awii. R. M. (»0H7). Human 

hH'dH£At(oii. and Jummlopy t« tlir Ji.-^lolMUin huih*i'- 
ftmily. »99^(M. 

F*rnp. O. F„ JohnMin. M, Dcmfrttlr. R, P. ((»«,■>). 

Aligning fuitimt »t-*U KeifiiMioeK miki|iarl^tn »F 
eotntnonJy used inKht«ix. J. Mut. A\vV, 21, 1 

Uoftd. \V. tt. & KMiHhisa. M. 1. (IWHi?). J»aumi 
rct-nfcntf iuii In nuclpic- ai-id c»>q4ieiiM*K. I. A p»twral 
im-thttd Terr hudm^ Kh'rI hftutoluperi #tu\ *\a\mr\ri**. 
Xutl* Arid" flrw. ID. ^47-^3- 

Crib^kov. M.. McLiihlrtii. A. I), A fcimMitHMy. O. (HifctT). 
I'rcifile niiAlynia; detrition Of dialarttty Matifl 
l»rol<-ins. Proc. Arnrf, Sri, t C.N. A. 84 r JU.Vt- 

(lWS.'i). Xurleotide binding by ">vinbrnne cnni|knif 

<tf b*cU»riA( |i?rl|i1aamir binding; ^r{ttvm-«(p]H k mlf-nt 

irhri5|K>rt aystnna. EM HO J. 4. 1039-1039* 
Higgin*. C- P., Hi1«b. I. D., 55almond, O. P., OilL U. R„ 

I>o*-nie. J. A., Evann, I, J., Holland. I. K.* Oray. t.. 

But-kd. S. D.. Hell. A. VV, & Hcrmod*oh t M. A. 
A f*mi)y of related ATP-binding gubuniu 

mu|ilt^J to many distlnrt biologietl prpresw* in 

bacteria. Katun (tendon). 323. 
Holmqui&t. JR.. Goodman- M., Conroy, T. & CwUisnirtk, J. 

|liJH3). The spatial diatrrbution of Ax«d muUlions 



wiihiri gene* coding for protein*. J. Mot. EwK 19. 

Hu4ain. I,. Van Hnuten, U. t Thomaa. D. C. A Ranrar. A r 
(19S0). 6equfrno«o of KnJuricfiia coti uvrA gene and 
prale'm tv^aal two potential ATP binding tium, 
/. AloT. CAcm, 261. 4995-4901. 

Ishiqka. N' r . TakahMhi, N\ i Putnam. F. (1986). 
Amino acid icquence of human plajm* * tB-glyto- 
proltin: homology Uk the imrYtUnoglobulin au[Krrgen« 
family. Proc. -Vo<. ^cod, Acf.. U.S.A* 23fl3-23o7. 

Johnson, T. Hruaka, K. S. & Adam*. L. F. 

The nucfeoUde sequence of the fxce£ gene of Pitno 
Adrif^t and a comparison of the ammo acid sequences 
of the acyl-proUfin aynthetawe from Aarveyi and 
r.yt^Aen". Biochtm. Efophyf, /fcj, Commun, 1^3 T 

Karlitu 4 AJiwhul. 5. F. (1690 J. Methods for a&eeraing 
the statistical signiftcatiue of molecular sequence 
ffKlunra by using general Scoring scheme*. Fw. Fat. 
Acad. Sci.< 87. 22B4-i2fi8, 

Karlin, S.. Ucmbg. A, & KawabatiuT. (lOflOj. StatiBtical 
^ompwitioA of high -scoring segments front molecular 
«equ«neeQ, Ann, Stat. 18, 57 1 -39 1. 

Kaumevcr. J. F„ tMUtti. J. 0. A Kottek, M. P r (1980^ 
The ftiRXA for a proUinaae inhibitor related to the 
HI-3G domain of intcr-x-tryp«in inhibitor abo 
enrodes a- f -microglobulin (protein HC). Xuitf. Adda 
/to. 14. 7e39-?ftS0. 

Ltfmian. D, J, & Jfearaoa. W. R. Rapid arid 

j$euKitive protein similarity searches. Stitnce, 227. 
I43.1-1J4K 

Mcltar-hlan. A. V, (197U- TeaU for comparing related 
amino acid sequence*. C\-to chrome c «.f>d cyloehrpmc 
f„,_ 7. Jfo/. ItiiW. 61. 40»-i24. 

Xwdteman. S. B. * Wunaeh. 0. D. (1970), A general 
method applicable to the ee&rch for rimiltritiea in the 
amino acid sequence* of two proteins. <•/- ATof. Biol 
4$, 443-463. 

0?iorio-Ke*M5, M. Keeafe. P. & Gibb*. A. (19»»K 
Xuctaotide sequence of the genum* of eggplant 
moitaie iynmvlrus. ritofpjn/, 172. 547-554. 

I'ark. V. M r & Siauffer, G. \\ (I089J. 1>XA sequence of the 
metC 1 g^ne and iis flanking regions from Sztmontfla 
iffpkimurium XJT2 and bDmolog\* with the Carre- 
s|mnding sequence of Escherichia tali* .V/o/. Otn. 
titntt. 216. 

)*aHhy» L. (I9H7J Deteeiing homolog\ of dfatantly related 
protein? wilb cons^nsua wqueiuts. */, Mai. Biol, 19$. 
567-577. 

lVrxt^ui. \\\ R. & Lipman, f>> J, ( I0SS). Improved tools 
for biological fjequeim comparison. Prof. tfal. Acad. 
$ri. r (\fj-A. 85. 24^4-2448. 

I*erti. M. & JSarhau- H. G. (1934]. Immunoglobulin gtnew 
of different subgroups ire intefdigitated within tha, 
VK locus. AW. Atids ^m. 12, 0229-9236. 

IVitiich. AI C\ & fiogu&ki. M. S. (lf»90). le apolipoprotein 
D a mamnnaiiati bilih-bihding protein? *Vcu? tiialogivt. 
2. IOT-206. 

^iu, F» Ray. P.. Brown, K.. Darker, f*. E. ? Jhanwar. S„ 
Ruddle. F. H. & ttewner. P. (IflSS). Primary 
tttrurture of c-kil: relationship with the CSF-I/PDGF 
reeepior kinase famfly*oneogenie acu'vution of v-kii 
in^olveA deletion of extracellular domain and C 
l^rminut. JCMBQ J. 7. lOud 1011, 

Kajkoviv. A., fJimoneen, J. Davis. R. E. & KottmatK 
F r M. (1989). Molecular etoning and acquenee analysis 
of S-hydroxy-S'mclhylgluuo^'Wnxyme A 
Kduetase from the human para«ttr ScJiistotcmui 



Received from < 415 398 3250 > at 1 118/01 5:38:26 PM [Eastern Standard Time] 




NOV. 8, 200 1 2:30PM iFL EH R HOHBACH TEST 



(I 



NO. 9026 



S. DUDA-DQC. DELIUERV5 



P. 12/26 

012 



r 



Amino Acid Subtltiuiion Afalrict* 



AGO 



"* An 

and 

it**. 

■oo- 
ene 

*»)- 
rid 



manrntm. Pr«. iV<rf. /fcarf, Set., V.J9.A* S6\ £217- 

Kao. J. K. M. (1987). Xew flooring matrix for amino arid 
reiidu* extfl&llgtft b&M*l on rwn'dur ch*r*<-tf Helir 
physical parameter!. Tnt. /. Froinn /f«. 29, 

27«-2Bl. 

RirhftrdMn* ML, Dilworth. M. J. & fcca*(tn, M I>. (1075). 

The ttnitiA acid sequence or lerghafmugSubiu 1 irom 

root nodule* of broad bean (Tine /ofr* L.). FEBS 

Ldltrt, SL 33-37. 
Riordan, J. K.. Koromen*. J. M , fcerem, II, H„ Alun. N- 

Kor,nl*hel T A., finielciftk. X.. Ziefennki, J„ Uolt. R.. 

Plavsic. N M Chou. ,J. Jj.. Drumm. M. lanituszi, 

M, C. Collin*. P. S- & Tsui. L. (l9H9 r . 

Identification or the cyttic fibrosis gene: t-loninji; And 

charfceteri2atinn of complementary DSA, &cun£e t 

245. 1066-1013. 
Rider* J. L.», D«torm« r M. O., Dchuroiy, H. A HoiuuE. A. 

(I9£S). Amino *tfid substitution* in #1n>tttUrflIly 

rtlttcd proteins A pattern recognition approach. 

Determination of a new and efficient storing matrix* 

^. -tfof. WatsWH. 1019-1029. 
K&nkoFf. D, A KrunlcaK J, B. (1983). Tint* Warp*. Siring 

Edits and Matmnitdtrut^; The Theory and PrtitAir* of 
- Sequence Compd r«>dn. Addiwn- Wesley. Reading 

MA. 

SfhwarU. K. M. & Dayhofl, M. O. (1*70)- Matrices tor 
detecting distant relatioiietiijw. Tn Allan of Protein 
Stqittntt and Structure (DayhofF. M. O.* ed.)- vol. 5, 
«u]ipi.3. pp. 353-35H. Nat Worried, Res. Found.. 
Washington, DC. 

$etlcTB, P. H. (1074)* On the theory fcnd computation of 
evolutionary diBi&nreK. SI AM J. Appl. Math. 26. 
7»7~7B3. 

Hellers, V, H. (ItJKHJ. I'utiern recognition in geurli? 
sequenrro bv mififnMfh denaity. Bull. Mtith. 46. 
AOI-GU. 

Simmone, D. & 2Swtl, B. (1988). Tile Key rwptor of 
Uttturat killer wiia id a phospholipid -I Inked membrane 
pr<il«in. -Vfllwrr (London/. 31$, 



^iniLh, T, F. & Waterman, M. J> (199!) Identification or 
mmmwi vni?l#CUtar ftubw^Utnrtn. J. Mat. RtiA. 141. 
I OA- I 07. 

Smilh, T- F„ W*term«r>. M. S. & Burks. C. (198S). The 

6t4ttalic»l distribution of nutltit arid similarities 

Xuci. Atid* fit*, 13, 645-^56, 
^lormo. U. D. 4 Harlzelt. C. W,. Ill (1989). Identifying 

unit? in-binding nlt^B from unaligned UXA rragm«nt&. 

/*rcr. iVar Acad. Set. CS A. W. 1 1 83- 1 1 B7, 
(Suzuki. T. (1&89). Amino acid s^qucnev of a major globm 

from the sea cucumber f'nracaudina rhifgruix. 

Bioehim. ftiophy*. Ac4a* 99S. 292-296. 
Taylor. W. ft. 41996). IdentifcaUon protein w«iurnct 

homology by tiMlMnaua template alignment. «/. 

Urtdfc, Y., Nugata. A,. Wuiuki. Y. & HayaUhi. O. (19B9). 
f'rim*ry itruetun; of rat brain proataglandin U 
syiiihelaiut d#ducfd from cDXA a«qucne*\ J* Hiol. 
Ckem. 264. 

ClJiell, T. & Corbin, K- W. (1971). FJUlnp d^rttc 
probability distributions to evolutionary pventa. 
£rt*ncf. 172. 1089-1096, 

\ f »n de Weghe. A., Coppieters, W.. h»uw. ti., 
Vandtrkffrrkhove. J. A Bou^un. Y. [1968). The 
homology between thtf wrum pr^teinjf POS in pig- Xk 
in hor*? and X|B^glyenprot#iit in human, t'omp, 
Biochem. FhytM. 90B, 75W750. 

W«t«rin4iU. M.-S,. Gordon. L. 4 Arrrtin. li.. (1^7). Vhnt* 
tranrtiliOHM ii» soqu^rtcc mutt-ht'ji mid iiwIvh* atkl 

Wilbur. J. (tSfctf), On ihv I^AM mairin mwivl of 
protein cvolutiun. Moh Hiol, ficol. 2. 43-1-447. 

ZalAf-ain. M.. (Soiizftln. A.. (S\i<:rrfro. M. t".. Mrttlaliaito. 
R ,1.. Mai part id*. l\ & Jtdiei^z. A. (1 1)80). 
Xurlpptidr seqiWrnce of hygrom^'^iift H phtitsjjho- 
trAnefera^? from StrtptomycP* hygmHrupirtfM, 



Kditrd hy F, Hi. Vithtn 




v Received from < 415 398 3250 > at 1 1I8RI1 5:38:26 PM [Eastern Standard Time] 



IP ^ 




f 




dealing 
veis of 
/. X-ray 

ieiding 



uctural 
tttienis, 
ire and 
lar and 
uCiuraJ 

$ struc- 

imunc- 

fnalion 

mcour- 

its that 

© also 

mtions 

Orients 

jsually 

Ogress 



NOV. 8. 200 1 2: 24PMMbmFLEHR HOHBACH TEST CS 650 3299B?R 




IP. 9025 fl ;_L_26 "n " i^^aMrt - 



CORE 



* 

Journal of 



MOLECULAR 

BIOLOGY 




Received from < 415 398 3250 > at 1 1/3/01 5:38:26 PM [Eastern Standard Time] 



N0V7 " 8.' "200 1 "■ 2;'25PM"' F L EH R*HOH BACH "TEST 7W D5U , JZMMb 



NO. 9025384 P. 27 12 



I 



Journal of Molecular Biology 



Volume 215,, Number 3 
Contents 



Cornnrnnfcations 

Preliminary Crytsl^llogra-phto Analysis of 
•rrypa-nolhiont* Reductase from Crithidia 

Xorbwiin, a '2 S Olohulin from Vfrwi nuWwwfttt* 
U Cry fltafl i xut ion tLftd Preliminary 
Oyef-a-llogriiphk' Ha* a 

Trigonal CryaUlft of Pnrvinp Mitoehonclrial 
Aspartate Arninotrftnsfei^p 



J. Kuriyan, JL Wong, B, P. Guertthcr, 335-337 
N, J. Murgolo, A. Ccraxnl and 
G. 8. Henderson 

M. Htnnig„ B. Schlesicr, S. PfeJfer and 33&-340 
W. B, Hohne 



T. Ltard, B. Fol, R. A* Pauptft and £41-344 
J. NT, Jansonius 



Articles 

MuiAtioitHl AiiAlywiw of ("wnsO^vvd Xiii'lt'oiidt 1 * in n 
■Srlf-fiI»hritiK Omup I fntrou 



Novel MutntiurtK that Alfrr Ifi* ftcgui&lion of 
Speculation in fiacUlua *v.bttlin* Kvidenw that 
Phrwphoryintion of Rt^uljitory Pmttnu SpnOA 
font rota the Initiation of Sporuhxtimi 

Order- Disorder Phenomena in Myelinate*! NVrvc 
Sheaths. I. a I'hyxicul Model and fis 
pAramctrissaiiun: ttx&cl »nd Ajipnixirrmle 
ltoUjimnatiun ofth<« P^rumcU'r* 

Ordor-DiMfn-d^r PhrnotnciiA to Myi/)imm*<| X<*rv(* 
Khuathsi. U. Thv Klrurtiirv ot Myelin in Xfltivi* -uul 
Kwulltm Rrtt Mntxiiv Nerves nn<\ in the ('miry* 1 of 
MyclmtigtWftis 

Biisic Alignment Search Toot 

■Solution Conformation of Piirijir-pyrimidiftr USA 
Ofitstrnt-i-s iminf: Nm-lfAr MAgntlir KwuitAm'p. 
Rastrainfri Mol^colm I)yniuiii(M and NOFMulmkI 
Refinrmvnr 

Thr**e-(iimwwinnjil Kta-tron Dlffraclioti of Photi 

TcmpArulurc Dependence uf Dynamics 

Hyd rated Myoglobin. Comparison of Fon-c PiVId 

CaI fiulu-tion* with Neutron twittering 

Hydrogen Bond rUerontihemtatry fn Protein 
N(,ruf:t.ure aod Function 

Erratum 
Author Index 



S, Couture, A, D- Ellington, 

A. S. G<srber t J, M. Cherry, 

JT. A, Poudna, R- Green, M, Hamw, 

U. Pace, J, Rajagopal and 

J, W- Szostak 

G, Glmcdo, E. G. Nlnfa, J. Stock and 
P. Youngman t 



V, Luzzati and I.. Mateu 



L. Mateu, V, Lujtzati, ft. Vargas 
£. Vonasek and M. Borgo 



S, F. AHschul, Glsh, W. Millar, 
E, W. Myera and p. J. Lipman 

J. D. Baleja, M* W, Germann, 

J, H, van de Sande and B. £>. SyVes 



P. J. WaJian and £L Jap 



R k J. LonchaHch ai*d B, R* Brooks 



J, A. Ippolito. R, S. Alexander and 
Dr W, Christiansen 



-4(12 



411-420 



473 
47fJ 



Received from < 415 398 3250 > at 11/8/01 5:33:26 PM [Eastern Standard Time] 



